Category: AI

  • Talk like a graph: Encoding graphs for large language models

    Talk like a graph: Encoding graphs for large language models

    Google AI

    Imagine all the things around you — your friends, tools in your kitchen, or even the parts of your bike. They are all connected in different ways. In computer science, the term graph is used to describe connections between objects. Graphs consist of nodes (the objects themselves) and edges (connections between two nodes, indicating a relationship between them). Graphs are everywhere now. The internet itself is a giant graph of websites linked together. Even the knowledge search engines use is organized in a graph-like way.

    Furthermore, consider the remarkable advancements in artificial intelligence — such as chatbots that can write stories in seconds, and even software that can interpret medical reports. This exciting progress is largely thanks to large language models (LLMs). New LLM technology is constantly being developed for different uses.

    Since graphs are everywhere and LLM technology is on the rise, in “Talk like a Graph: Encoding Graphs for Large Language Models”, presented at ICLR 2024, we present a way to teach powerful LLMs how to better reason with graph information. Graphs are a useful way to organize information, but LLMs are mostly trained on regular text. The objective is to test different techniques to see what works best and gain practical insights. Translating graphs into text that LLMs can understand is a remarkably complex task. The difficulty stems from the inherent complexity of graph structures with multiple nodes and the intricate web of edges that connect them. Our work studies how to take a graph and translate it into a format that an LLM can understand. We also design a benchmark called GraphQA to study different approaches on different graph reasoning problems and show how to phrase a graph-related problem in a way that enables the LLM to solve the graph problem. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: 1) the graph encoding method, 2) the nature of the graph task itself, and 3) interestingly, the very structure of the graph considered. These findings give us clues on how to best represent graphs for LLMs. Picking the right method can make the LLM up to 60% better at graph tasks!

    Pictured, the process of encoding a graph as text using two different approaches and feeding the text and a question about the graph to the LLM.

    Graphs as text

    To be able to systematically find out what is the best way to translate a graph to text, we first design a benchmark called GraphQA. Think of GraphQA as an exam designed to evaluate powerful LLMs on graph-specific problems. We want to see how well LLMs can understand and solve problems that involve graphs in different setups. To create a comprehensive and realistic exam for LLMs, we don’t just use one type of graph, we use a mix of graphs ensuring breadth in the number of connections. This is mainly because different graph types make solving such problems easier or harder. This way, GraphQA can help expose biases in how an LLM thinks about the graphs, and the whole exam gets closer to a realistic setup that LLMs might encounter in the real world.

    Overview of our framework for reasoning with graphs using LLMs.

    GraphQA focuses on simple tasks related to graphs, like checking if an edge exists, calculating the number of nodes or edges, finding nodes that are connected to a specific node, and checking for cycles in a graph. These tasks might seem basic, but they require understanding the relationships between nodes and edges. By covering different types of challenges, from identifying patterns to creating new connections, GraphQA helps models learn how to analyze graphs effectively. These basic tasks are crucial for more complex reasoning on graphs, like finding the shortest path between nodes, detecting communities, or identifying influential nodes. Additionally, GraphQA includes generating random graphs using various algorithms like Erdős-Rényi, scale-free networks, Barabasi-Albert model, and stochastic block model, as well as simpler graph structures like paths, complete graphs, and star graphs, providing a diverse set of data for training.

    When working with graphs, we also need to find ways to ask graph-related questions that LLMs can understand. Prompting heuristics are different strategies for doing this. Let’s break down the common ones:

    • Zero-shot: simply describe the task (“Is there a cycle in this graph?”) and tell the LLM to go for it. No examples provided.
    • Few-shot: This is like giving the LLM a mini practice test before the real deal. We provide a few example graph questions and their correct answers.
    • Chain-of-Thought: Here, we show the LLM how to break down a problem step-by-step with examples. The goal is to teach it to generate its own “thought process” when faced with new graphs.
    • Zero-CoT: Similar to CoT, but instead of training examples, we give the LLM a simple prompt, like “Let’s think step-by-step,” to trigger its own problem-solving breakdown.
    • BAG (build a graph): This is specifically for graph tasks. We add the phrase “Let’s build a graph…” to the description, helping the LLM focus on the graph structure.

    We explored different ways to translate graphs into text that LLMs can work with. Our key questions were:

    • Node encoding: How do we represent individual nodes? Options tested include simple integers, common names (people, characters), and letters.
    • Edge encoding: How do we describe the relationships between nodes? Methods involved parenthesis notation, phrases like “are friends”, and symbolic representations like arrows.

    Various node and edge encodings were combined systematically. This led to functions like the ones in the following figure:

    Examples of graph encoding functions used to encode graphs via text.

    Analysis and results

    We carried out three key experiments: one to test how LLMs handle graph tasks, and two to understand how the size of the LLM and different graph shapes affected performance. We run all our experiments on GraphQA.

    How LLMs handle graph tasks

    In this experiment, we tested how well pre-trained LLMs tackle graph problems like identifying connections, cycles, and node degrees. Here is what we learned:

    • LLMs struggle: On most of these basic tasks, LLMs did not do much better than a random guess.
    • Encoding matters significantly: How we represent the graph as text has a great effect on LLM performance. The “incident” encoding excelled for most of the tasks in general.

    Our results are summarized in the following chart.

    Comparison of various graph encoder functions based on their accuracy on different graph tasks. The main conclusion from this figure is that the graph encoding functions matter significantly.

    Bigger is (usually) better

    In this experiment, we wanted to see if the size of the LLM (in terms of the number of parameters) affects how well they can handle graph problems. For that, we tested the same graph tasks on the XXS, XS, S, and L sizes of PaLM 2. Here is a summary of our findings:

    • In general, bigger models did better on graph reasoning tasks. It seems like the extra parameters gave them space to learn more complex patterns.
    • Oddly, size didn’t matter as much for the “edge existence” task (finding out if two nodes in a graph are connected).
    • Even the biggest LLM couldn’t consistently beat a simple baseline solution on the cycle check problem (finding out if a graph contains a cycle or not). This shows LLMs still have room to improve with certain graph tasks.
    Effect of model capacity on graph reasoning task for PaLM 2-XXS, XS, S, and L.

    Do different graph shapes confuse LLMs

    We wondered if the “shape” of a graph (how nodes are connected) influences how well LLMs can solve problems on it. Think of the following figure as different examples of graph shapes.

    Samples of graphs generated with different graph generators from GraphQA. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively.

    We found that graph structure has a big impact on LLM performance. For example, in a task asking if a cycle exists, LLMs did great on tightly interconnected graphs (cycles are common there) but struggled on path graphs (where cycles never happen). Interestingly, providing some mixed examples helped it adapt. For instance, for cycle check, we added some examples containing a cycle and some examples with no cycles as few-shot examples in our prompt. Similar patterns occurred with other tasks.

    Comparing different graph generators on different graph tasks. The main observation here is that graph structure has a significant impact on the LLM’s performance. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively.

    Conclusion

    In short, we dug deep into how to best represent graphs as text so LLMs can understand them. We found three major factors that make a difference:

    • How to translate the graph to text: how we represent the graph as text significantly influences LLM performance. The incident encoding excelled for most of the tasks in general..
    • Task type: Certain types of graph questions tend to be harder for LLMs, even with a good translation from graph to text.
    • Graph structure: Surprisingly, the “shape” of the graph that on which we do inference (dense with connections, sparse, etc.) influences how well an LLM does.

    This study revealed key insights about how to prepare graphs for LLMs. The right encoding techniques can significantly boost an LLM’s accuracy on graph problems (ranging from around 5% to over 60% improvement). Our new benchmark, GraphQA, will help drive further research in this area.

    Acknowledgements

    We would like to express our gratitude to our co-author, Jonathan Halcrow, for his valuable contributions to this work. We express our sincere gratitude to Anton Tsitsulin, Dustin Zelle, Silvio Lattanzi, Vahab Mirrokni, and the entire graph mining team at Google Research, for their insightful comments, thorough proofreading, and constructive feedback which greatly enhanced the quality of our work. We would also like to extend special thanks to Tom Small for creating the animation used in this post.

    Originally appeared here:
    Talk like a graph: Encoding graphs for large language models

    Go Here to Read this Fast! Talk like a graph: Encoding graphs for large language models

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Run an audience overlap analysis in AWS Clean Rooms

    Run an audience overlap analysis in AWS Clean Rooms

    Eric Saccullo

    In this post, we explore what an audience overlap analysis is, discuss the current technical approaches and their challenges, and illustrate how you can run secure audience overlap analysis using AWS Clean Rooms.

    Originally appeared here:
    Run an audience overlap analysis in AWS Clean Rooms

    Go Here to Read this Fast! Run an audience overlap analysis in AWS Clean Rooms

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Large language model inference over confidential data using AWS Nitro Enclaves

    Large language model inference over confidential data using AWS Nitro Enclaves

    Chris Renzo

    This post is co-written with Justin Miles, Liv d’Aliberti, and Joe Kovba from Leidos.  Leidos is a Fortune 500 science and technology solutions leader working to address some of the world’s toughest challenges in the defense, intelligence, homeland security, civil, and healthcare markets. In this post, we discuss how Leidos worked with AWS to develop an […]

    Originally appeared here:
    Large language model inference over confidential data using AWS Nitro Enclaves

    Go Here to Read this Fast! Large language model inference over confidential data using AWS Nitro Enclaves

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

  • Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Andreas Naoum

    How to easily visualise MediaPipe’s human pose tracking with Rerun

    Human Pose Tracking | Image by Author

    Overview

    We explore a use case that leverages the power of MediaPipe for tracking human poses in both 2D and 3D. What makes this exploration even more fascinating is the visualisation aspect powered by the open-source visualisation tool Rerun, which provides a holistic view of human poses in action.

    In this blog post, you’ll be guided to use MediaPipe to track human poses in 2D and 3D, and explore the visualisation capabilities of Rerun.

    Human Pose Tracking

    Human pose tracking is a task in computer vision that focuses on identifying key body locations, analysing posture, and categorising movements. At the heart of this technology is a pre-trained machine-learning model to assess the visual input and recognise landmarks on the body in both image coordinates and 3D world coordinates. The use cases and applications of this technology include but are not limited to Human-Computer Interaction, Sports Analysis, Gaming, Virtual Reality, Augmented Reality, Health, etc.

    It would be good to have a perfect model, but unfortunately, the current models are still imperfect. Although datasets could have a variety of body types, the human body differs among individuals. The uniqueness of each individual’s body poses a challenge, particularly for those with non-standard arm and leg dimensions, which may result in lower accuracy when using this technology. When considering the integration of this technology into systems, it is crucial to acknowledge the possibility of inaccuracies. Hopefully, ongoing efforts within the scientific community will pave the way for the development of more robust models.

    Beyond lack of accuracy, ethical and legal considerations emerge from utilising this technology. For instance, capturing human body poses in public spaces could potentially invade privacy rights if individuals have not given their consent. It’s crucial to take into account any ethical and legal concerns before implementing this technology in real-world scenarios.

    Prerequisites & Setup

    Begin by installing the required libraries:

    # Install the required Python packages 
    pip install mediapipe
    pip install numpy
    pip install opencv-python<4.6
    pip install requests>=2.31,<3
    pip install rerun-sdk

    # or just use the requirements file
    pip install -r examples/python/human_pose_tracking/requirements.txt

    Track Human Pose using MediaPipe

    Image via Pose Landmark Detection Guide by Google [1]

    MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning.

    In the code below, MediaPipe pose landmark detection was utilised for detecting landmarks of human bodies in an image. This model can detect body pose landmarks as both image coordinates and 3D world coordinates. Once you have successfully run the ML model, you can use the image coordinates and the 3D world coordinates to visualise the output.

    import mediapipe as mp
    import numpy as np
    from typing import Any
    import numpy.typing as npt
    import cv2


    """
    Read 2D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.
    image_width (int): Width of the input image.
    image_height (int): Height of the input image.

    Returns:
    np.array | None: Array of 2D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_2d(
    results: Any,
    image_width: int,
    image_height: int,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract normalized landmark positions and scale them to image dimensions
    normalized_landmarks = [results.pose_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(image_width * lm.x, image_height * lm.y) for lm in normalized_landmarks])


    """
    Read 3D landmark positions from Mediapipe Pose results.

    Args:
    results (Any): Mediapipe Pose results.

    Returns:
    np.array | None: Array of 3D landmark positions or None if no landmarks are detected.
    """
    def read_landmark_positions_3d(
    results: Any,
    ) -> npt.NDArray[np.float32] | None:
    if results.pose_landmarks is None:
    return None
    else:
    # Extract 3D landmark positions
    landmarks = [results.pose_world_landmarks.landmark[lm] for lm in mp.solutions.pose.PoseLandmark]
    return np.array([(lm.x, lm.y, lm.z) for lm in landmarks])


    """
    Track and analyze pose from an input image.

    Args:
    image_path (str): Path to the input image.
    """
    def track_pose(image_path: str) -> None:
    # Read the image, convert color to RGB
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Create a Pose model instance
    pose_detector = mp.solutions.pose.Pose(static_image_mode=True)

    # Process the image to obtain pose landmarks
    results = pose_detector.process(image)
    h, w, _ = image.shape

    # Read 2D and 3D landmark positions
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    landmark_positions_3d = read_landmark_positions_3d(results)

    Visualise the output of MediaPipe using Rerun

    Rerun Viewer | Image via Rerun Docs [2]

    Rerun serves as a visualisation tool for multi-modal data. Through the Rerun Viewer, you can build layouts, customise visualisations and interact with your data. The rest part of this section details how you can log and present data using the Rerun SDK to visualise it within the Rerun Viewer

    Pose Landmarker Model | Image via Pose Landmark Detection Guide by Google [1]

    In both 2D and 3D points, specifying connections between points is essential. Defining these connections automatically renders lines between them. Using the information provided by MediaPipe, you can get the pose points connections from the POSE_CONNECTIONS set and then set them as keypoint connections using Annotation Context.

    rr.log(
    "/",
    rr.AnnotationContext(
    rr.ClassDescription(
    info=rr.AnnotationInfo(id=0, label="Person"),
    keypoint_annotations=[rr.AnnotationInfo(id=lm.value, label=lm.name) for lm in mp_pose.PoseLandmark],
    keypoint_connections=mp_pose.POSE_CONNECTIONS,
    )
    ),
    timeless=True,
    )

    Image Coordinates — 2D Positions

    Visualising Human Pose as 2D Points | Image by Author

    Visualising the body pose landmarks on the video appears to be a good choice. To achieve that, you need to follow the rerun documentation for Entities and Components. The Entity Path Hierarchy page describes how to log multiple Components on the same Entity. For example, you can create the ‘video’ entity and include the components ‘video/rgb’ for the video and ‘video/pose’ for the body pose. If you’re aiming to use that for a video, you need the concept of Timelines. Each frame can be associated with the appropriate data.

    Here is a function that can visualise the 2D points on the video:

    def track_pose_2d(video_path: str) -> None:
    mp_pose = mp.solutions.pose

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # Log 2d points to 'video' entity
    landmark_positions_2d = read_landmark_positions_2d(results, w, h)
    if landmark_positions_2d is not None:
    rr.log(
    "video/pose/points",
    rr.Points2D(landmark_positions_2d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    3D World Coordinates — 3D Points

    Visualising Human Pose as 3D Points | Image by Author

    Why settle on 2D points when you have 3D Points? Create a new entity, name it “Person”, and log the 3D points. It’s done! You just created a 3D presentation of the human body pose.

    Here is how to do it:

    def track_pose_3d(video_path: str, *, segment: bool, max_frame_count: int | None) -> None:
    mp_pose = mp.solutions.pose

    rr.log("person", rr.ViewCoordinates.RIGHT_HAND_Y_DOWN, timeless=True)

    with closing(VideoSource(video_path)) as video_source, mp_pose.Pose() as pose:
    for idx, bgr_frame in enumerate(video_source.stream_bgr()):
    if max_frame_count is not None and idx >= max_frame_count:
    break

    rgb = cv2.cvtColor(bgr_frame.data, cv2.COLOR_BGR2RGB)

    # Associate frame with the data
    rr.set_time_seconds("time", bgr_frame.time)
    rr.set_time_sequence("frame_idx", bgr_frame.idx)

    # Present the video
    rr.log("video/rgb", rr.Image(rgb).compress(jpeg_quality=75))

    # Get the prediction results
    results = pose.process(rgb)
    h, w, _ = rgb.shape

    # New entity "Person" for the 3D presentation
    landmark_positions_3d = read_landmark_positions_3d(results)
    if landmark_positions_3d is not None:
    rr.log(
    "person/pose/points",
    rr.Points3D(landmark_positions_3d, class_ids=0, keypoint_ids=mp_pose.PoseLandmark),
    )

    Source Code Exploration

    The tutorial focuses on the main parts of the Human Pose Tracking example. For those who prefer a hands-on approach, the full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

    Tips & Suggestions

    1. Compress the image for efficiency

    You can boost the overall procedure speed by compressing the logged images:

    rr.log(
    "video",
    rr.Image(img).compress(jpeg_quality=75)
    )

    2. Limit Memory Use

    If you’re logging more data than can be fitted into your RAM, it will start dropping the old data. The default limit is 75% of your system RAM. If you want to increase that you could use the command line argument — memory-limit. More information about memory limits can be found on Rerun’s How To Limit Memory Use page.

    3. Customise Visualisations for your needs

    Customise Rerun Viewer | Image by Author

    Beyond Human Pose Tracking

    If you found this article useful and insightful, there’s more!

    Similar articles:

    Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

    I regularly share tutorials on visualisation for computer vision and robotics. Follow me for future updates!

    Also, you can find me on LinkedIn.

    Sources

    [1] Pose Landmark Detection Guide by Google, Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

    [2] Rerun Docs by Rerun under MIT license


    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase

    Go Here to Read this Fast! Human Pose Tracking with MediaPipe in 2D and 3D: Rerun Showcase