Georgia Tech researchers recently presented their work at leading programming and systems conferences, focusing on static ...
SOLE is highly generalizable and can segment corresponding instances with various language instructions, including but not limited to visual questions, attributes description, and functional ...
Abstract: It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in ...
Abstract: Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, ...