AI is too stupid to distinguish things? 3D point cloud + GAN let robot “eyes” sharper!

liu, tempo Date: 2021-08-10 10:07:40
Views:90 Reply:0

With the continuous development of AI and robotics, people’s lives have been helped by “AI robots” of all kinds: from space robots to assist in space missions, to small home cleaning robots to free our hands, it can be said that the role of robots in human life is becoming more and more diverse.


But did you know? The visual sensitivity of robots currently used for indoor tasks, especially those that require frequent interaction with the environment, still needs further improvement – many robots are unable to discern the subtle differences when faced with similar objects.


Recently, a team of researchers from the University of Texas at Arlington (UTA) has proposed a method called PCGAN. The researchers say this is the first conditional generative adversarial network (GAN) to generate 3D point clouds in an unsupervised manner, which can produce 3D color point clouds with multiple resolutions and fine details to generate discriminative images of objects, which would be extremely beneficial for improving robot vision sensitivity. Without further ado, let’s start with the images.




What if the image is not realistic?


Imagine how a floor cleaning robot works at home. Generally speaking, these robots need to interact with the environment first to complete the navigation task in the built environment, which requires the robot to be able to sense the environment and make real-time decisions about how to interact with its surroundings at the moment.


For robots to have this self-determination capability, scientists need to use methods such as machine learning and deep learning to train Ta: by using large sets of collected image data as training data to train the robot to respond correctly to different objects or environments.


To achieve this, some people use manual methods to collect image data, such as capturing the environment of a house using an expensive 360-degree panoramic camera, or taking partial images and then using various software to stitch together individual images into a panoramic image of the house. However, it is obvious that this manual capture method is too inefficient to meet the training requirements that require large amounts of data.


On the other hand, despite having millions of inter-room photos and videos at hand, none of these data were taken from a vantage point such as the one where the sweeper was located. Thus, it was not advisable to try to train the robot using images from a human-centered perspective.


So the team turned to a deep learning approach called generative adversarial networks to create images realistic enough to train a robot to improve its ability to discern its environment.


As a type of generative model, the main structure of GAN consists of two neural networks: the Generator and the Discriminator. The generator continuously generates fake images, and the discriminator determines whether these images are true or false. The two neural networks thus compete with each other, resulting in a very strong ability to produce samples. Once trained, such a network will be able to create a myriad of possible indoor or outdoor environments with a variety of objects such as tables, chairs, or vehicles. The differences between these objects would become very small, but the images would still carry recognizable dimensions and features for both humans and robots.


PCGAN: sharper 3D point cloud images


The entire research team consists of William Beksi, an assistant professor in the Department of Computer Science and Engineering at UTA, and six of his doctoral students. Mohammad Samiul Arshad, a doctoral student involved in the study, said, “Designing these objects manually would be resource- and labor-intensive, whereas with proper training, a generative network could do the same task in seconds.”

The image data in this study is presented through a 3D point cloud, a form of object image obtained through a 3D scanner, which records objects in the form of points, each of which contains 3D coordinates, intensity information (which reflects the material, roughness, direction of incidence, etc. of the target object), and possibly color information (RGB).


In this regard, Beksi explains: “We can move them to new locations, even using different lights, colors and textures, and render them as training images that can be used in the dataset. This approach could potentially provide an unlimited amount of data to train the robot.”


In their experiments, the researchers used as a dataset ShapeNetCore, a collection of CAD models for various object classes. They chose images of chairs, tables, sofas, airplanes and motorcycles for their experiments to accommodate the diversity of object shapes; and set the number of each class to five to reduce training time. In addition, all CAD models without material and color information were eliminated.


Our model first learns the basic structure of a low-resolution object, and then gradually builds up high-level details,” he explains. For example, the relationship between the parts of an object and its color – the legs of a chair/table are the same color while the seat/roof is a very different color. We build hierarchies for complete synthetic scene generation, which will be very useful for robotics.”


They generated 5,000 random samples for each category and evaluated them using a variety of different methods. They evaluated the geometry and color of the point clouds using a variety of metrics commonly used in the field. The results show that PCGAN is able to synthesize high-quality point clouds for different kinds of object classes.


One small step


While PCGAN does outperform some traditional sample training methods, as Beksi says, “This research is just a small step toward our ultimate goal of generating indoor panoramas that are realistic enough to improve robot perception.”


In addition, Beksi is working on another problem – Sim2real, which focuses on how to quantify subtle differences and make simulated images more realistic by capturing the physical properties of the scene (friction, collision, gravity) and using ray or photon tracking.


He said, “If it is the inclusion of more points and details due to increased resolution, then the cost is the increase in computational cost.” In addition to the computational requirements, Beksi also needed a lot of storage to conduct the study. The team generates hundreds of megabytes of data per second, and each point cloud has about 1 million points, so these training data sets are huge and require a lot of storage space.


Next, Beksi’s team wants to deploy the software on the robot and see where it falls short of simulating the real domain. Of course, while there is still a long way to go before we have truly powerful robots that can operate autonomously for long periods of time, the researchers’ work will certainly benefit a number of fields, such as healthcare, manufacturing and agriculture.

Leave a comment

You must Register or Login to post a comment.