17 ๋ถ„ ์†Œ์š”


์ฃผ์ œ

  • ์‚ฌ๋žŒ์˜ 3D ํฌ์ฆˆ ์ถ”์ •(HPE, Human Pose Forecasting)๊ณผ Trajectory Prediction(์‚ฌ๋žŒ์˜ ์ด๋™ ๊ฒฝ๋กœ ์˜ˆ์ธก)์€ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ด€๋ จ ๋ณด๊ณ ์„œ์ด๋‹ค. Google gemini Deep Search ๊ธฐ๋Šฅ์„ ํ†ตํ•œ ๋ณด๊ณ ์„œ์ด๋‹ค.

์ตœ์‹  ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋น„๊ต ์—ฐ๊ตฌ ๋™ํ–ฅ ๋ถ„์„ ๋ณด๊ณ ์„œ

I. ์„œ๋ก 

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก(Human Pose Forecasting/Prediction)์€ ๊ณผ๊ฑฐ์˜ ๊ด€์ฐฐ๋œ ์ธ๊ฐ„ ๋™์ž‘ ์‹œํ€€์Šค๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฏธ๋ž˜์˜ ์ธ๊ฐ„ ์ž์„ธ ๋˜๋Š” ๋™์ž‘์„ ์˜ˆ์ธกํ•˜๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฐ ๊ทธ๋ž˜ํ”ฝ์Šค ๋ถ„์•ผ์˜ ํ•ต์‹ฌ ์—ฐ๊ตฌ ์ฃผ์ œ์ž…๋‹ˆ๋‹ค.1

์ธ๊ฐ„์€ ๋ณธ๋Šฅ์ ์œผ๋กœ ํƒ€์ธ์˜ ์›€์ง์ž„์„ ์˜ˆ์ธกํ•˜์—ฌ ๋ณต์žกํ•œ ํ™˜๊ฒฝ ์†์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด๋™ํ•˜๊ณ  ์ž ์žฌ์  ์œ„ํ—˜์„ ํšŒํ”ผํ•˜์ง€๋งŒ, ๊ธฐ๊ณ„๊ฐ€ ์ด๋Ÿฌํ•œ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๋Š” ๊ฒƒ์€ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.1 ํŠนํžˆ ๋กœ๋ด‡ ๊ณตํ•™, ์ž์œจ ์ฃผํ–‰ ์ž๋™์ฐจ, ์ธ๊ฐ„-์ปดํ“จํ„ฐ ์ƒํ˜ธ์ž‘์šฉ(HCI), ๊ฐ€์ƒ/์ฆ๊ฐ• ํ˜„์‹ค(VR/AR), ์Šคํฌ์ธ  ๋ถ„์„, ์˜๋ฃŒ ๋ฐ ํ—ฌ์Šค์ผ€์–ด ๋“ฑ์—์„œ ๊ทธ ์ค‘์š”์„ฑ์ด ๋ถ€๊ฐ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.2

์ดˆ๊ธฐ ์—ฐ๊ตฌ๋Š” ์ฃผ๋กœ ๋‹จ์ผ ์ธ๋ฌผ์˜ ์งง์€ ์‹œ๊ฐ„(์•ฝ 1์ดˆ ์ด๋‚ด) ๋™์•ˆ์˜ ๋™์ž‘ ์˜ˆ์ธก์— ์ดˆ์ ์„ ๋งž์ถ”์—ˆ์œผ๋‚˜1, ์ตœ๊ทผ ์—ฐ๊ตฌ ๋™ํ–ฅ์€ ๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ๋ณต์žกํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ™•์žฅ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์˜ˆ์ธก ์‹œ๊ฐ„ ์ง€ํ‰์„ ์ˆ˜ ์ดˆ ์ด์ƒ์œผ๋กœ ๋Š˜๋ฆฌ๋Š” ์žฅ๊ธฐ ์˜ˆ์ธก(Long-term Prediction)1, ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ณ ๋ คํ•˜๋Š” ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์˜ˆ์ธก(Multi-agent Prediction)1, ์˜ˆ์ธก์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ชจ๋ธ๋งํ•˜๋Š” ํ™•๋ฅ ๋ก ์  ์˜ˆ์ธก(Probabilistic Prediction)2, ๊ทธ๋ฆฌ๊ณ  ํŠน์ • ๊ฐœ์ธ์˜ ๊ณ ์œ ํ•œ ์›€์ง์ž„ ํŒจํ„ด์— ์ ์‘ํ•˜๋Š” ๊ฐœ์ธํ™” ์˜ˆ์ธก(Personalized Prediction)3 ๋“ฑ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์—ฐ๊ตฌ ๋™ํ–ฅ์˜ ๋ณ€ํ™”๋Š” ๋” ์ •ํ™•ํ•˜๊ณ  ํ˜„์‹ค์ ์ธ ์˜ˆ์ธก์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์˜ ๊ฐœ๋ฐœ์„ ์ด‰์ง„ํ•˜๋Š” ๋™์‹œ์—, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ๋“ค์˜ ์„ฑ๋Šฅ์„ ๊ฐ๊ด€์ ์œผ๋กœ ๋น„๊ตํ•˜๊ณ  ํ‰๊ฐ€ํ•  ํ•„์š”์„ฑ์„ ์ฆ๋Œ€์‹œํ‚ต๋‹ˆ๋‹ค.

๋ณธ ๋ณด๊ณ ์„œ๋Š” 2020๋…„๋ถ€ํ„ฐ 2025๋…„ ์‚ฌ์ด ๋ฐœํ‘œ๋œ ์ตœ์‹  ์—ฐ๊ตฌ, ํŠนํžˆ ๋‹ค์–‘ํ•œ ์˜ˆ์ธก ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ง์ ‘์ ์œผ๋กœ ๋น„๊ต, ๋ถ„์„, ํ‰๊ฐ€ํ•œ ๋…ผ๋ฌธ๋“ค์„ ์ค‘์‹ฌ์œผ๋กœ ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ ๋™ํ–ฅ์„ ์‹ฌ์ธต์ ์œผ๋กœ ๊ฒ€ํ† ํ•˜๊ณ  ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ฃผ์š” ์˜ˆ์ธก ๋ชจ๋ธ, ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ์ง€ํ‘œ, ๋น„๊ต ์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ทธ๋ฆฌ๊ณ  ์—ฐ๊ตฌ์˜ ๊ฐ•์ ๊ณผ ํ•œ๊ณ„์ ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ณ , ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

II. ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ฐฉ๋ฒ•๋ก  ๊ฐœ์š”

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์—ฐ๊ตฌ๋Š” ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „ํ•ด์™”์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ฐฉ๋ฒ•๋ก ์€ ์ธ๊ฐ„ ๋™์ž‘์˜ ์‹œ๊ณต๊ฐ„์  ํŠน์„ฑ์„ ํฌ์ฐฉํ•˜๊ณ  ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์œ ํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

  • ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง (Recurrent Neural Networks, RNNs):
    • LSTM(Long Short-Term Memory)๊ณผ ๊ฐ™์€ RNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ์ˆœ์ฐจ์ ์ธ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ๊ฐ•์ ์„ ๋ณด์—ฌ ์ดˆ๊ธฐ ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์—ฐ๊ตฌ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.2
    • ์ด ๋ชจ๋ธ๋“ค์€ ๊ณผ๊ฑฐ ๋™์ž‘์˜ ์‹œ๊ฐ„์  ๋งฅ๋ฝ์„ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ๋‹ค์Œ ํ”„๋ ˆ์ž„์˜ ์ž์„ธ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋‚˜ ๊ธด ์‹œํ€€์Šค์— ๋Œ€ํ•œ ์˜์กด์„ฑ ํ•™์Šต์— ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์˜ค์ฐจ๊ฐ€ ๋ˆ„์ ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด ์žฅ๊ธฐ ์˜ˆ์ธก์—๋Š” ํ•œ๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.4
  • ๊ทธ๋ž˜ํ”„ ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ (Graph Convolutional Networks, GCNs):
    • ์ธ๊ฐ„ ๊ณจ๊ฒฉ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ๊ด€์ ˆ ๊ฐ„์˜ ๊ณต๊ฐ„์  ๊ด€๊ณ„๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด GCN์ด ๋„์ž…๋˜์—ˆ์Šต๋‹ˆ๋‹ค.2
    • GCN์€ ์‹ ์ฒด ๋ถ€์œ„ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, RNN๊ณผ ๊ฒฐํ•ฉ๋˜์–ด ์‹œ๊ณต๊ฐ„์  ํŠน์ง•์„ ํ•จ๊ป˜ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ(์˜ˆ: DMST-GRNN4)๋„ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ํŠธ๋žœ์Šคํฌ๋จธ (Transformers):
    • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ์„ฑ๊ณต์„ ๊ฑฐ๋‘” ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์‹œํ€€์Šค ๋‚ด์˜ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์  ๋•Œ๋ฌธ์— ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ถ„์•ผ์—์„œ๋„ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.4
    • ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์‹œ๊ฐ„์ , ๊ณต๊ฐ„์  ๊ด€๊ณ„๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉฐ, ํŠนํžˆ ์žฅ๊ธฐ ์˜ˆ์ธก์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    • MotionBERT5์™€ ๊ฐ™์€ ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณ€์ดํ˜• ์˜คํ† ์ธ์ฝ”๋” (Variational Autoencoders, VAEs) ๋ฐ ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง (Generative Adversarial Networks, GANs):
    • ๋ฏธ๋ž˜ ๋™์ž‘์˜ ๋ถˆํ™•์‹ค์„ฑ๊ณผ ๋‹ค์ค‘ ๋ชจ๋“œ(multi-modal) ํŠน์„ฑ์„ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด VAE๋‚˜ GAN๊ณผ ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ์ด ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.2
    • ์ด ๋ชจ๋ธ๋“ค์€ ๋‹จ์ผ ์˜ˆ์ธก ๋Œ€์‹  ๊ฐ€๋Šฅํ•œ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ๋™์ž‘์˜ ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•˜์—ฌ ๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ๋‹ค์–‘ํ•œ ์˜ˆ์ธก์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, Parsaeifard ๋“ฑ์€ VAE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง€์—ญ์  ์ž์„ธ ๋™์—ญํ•™์„ ์œ„ํ•œ ์ƒ์„ฑ์  ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.2
  • ํ™•์‚ฐ ๋ชจ๋ธ (Diffusion Models):
    • ์ตœ๊ทผ ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ถ„์•ผ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ํ™•์‚ฐ ๋ชจ๋ธ์ด ์ธ๊ฐ„ ๋™์ž‘ ์˜ˆ์ธก ๋ฐ ์ƒ์„ฑ ๋ถ„์•ผ์—๋„ ํ™œ๋ฐœํžˆ ์ ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.6
    • ํ™•์‚ฐ ๋ชจ๋ธ์€ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ๊ณ ํ’ˆ์งˆ์˜ ๋‹ค์–‘ํ•œ ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์œผ๋กœ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    • MDM7, PhysDiff8, AAMDM9 ๋“ฑ์ด ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.
  • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ฐ ๋ถ„๋ฆฌ ๋ชจ๋ธ (Hybrid & Decoupled Models):
    • ์ „์—ญ์ ์ธ ์ด๋™ ๊ฒฝ๋กœ(trajectory) ์˜ˆ์ธก๊ณผ ์ง€์—ญ์ ์ธ ์ž์„ธ(local pose) ์˜ˆ์ธก์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹๋„ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.1
    • ์ด๋Š” ํŠนํžˆ ์žฅ๊ธฐ ์˜ˆ์ธก์ด๋‚˜ ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋ณต์žก์„ฑ์„ ๊ด€๋ฆฌํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ํšจ๊ณผ์ ์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.
    • T2P10 ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๊ฐ๊ฐ์˜ ์žฅ๋‹จ์ ์„ ๊ฐ€์ง€๋ฉฐ, ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋™์ž‘์˜ ํŠน์„ฑ(๋‹จ๊ธฐ/์žฅ๊ธฐ, ๋‹จ์ผ/๋‹ค์ค‘ ์—์ด์ „ํŠธ, ๊ฒฐ์ •๋ก ์ /ํ™•๋ฅ ๋ก ์ )๊ณผ ์‘์šฉ ๋ถ„์•ผ์˜ ์š”๊ตฌ์‚ฌํ•ญ์— ๋”ฐ๋ผ ์„ ํƒ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜ ๊ฒฐํ•ฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

III. ๋ฒค์น˜๋งˆํ‚น ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ํ‰๊ฐ€ ์ง€ํ‘œ

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐ๊ด€์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํ‚น ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ ์ ˆํ•œ ํ‰๊ฐ€ ์ง€ํ‘œ๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

A. ์ฃผ์š” ๋ฒค์น˜๋งˆํ‚น ๋ฐ์ดํ„ฐ์…‹

๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์—ฐ๊ตฌ์— ํ™œ์šฉ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ฐ์ดํ„ฐ์…‹์€ ์ˆ˜์ง‘ ํ™˜๊ฒฝ, ์ธ์›์ˆ˜, ๋™์ž‘ ์œ ํ˜•, ์ฃผ์„ ์ •ํ™•๋„ ๋“ฑ์—์„œ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค.

  • Human3.6M (H3.6M)3:
    • 3D ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ฐ ๊ด€๋ จ ์ž‘์—…์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์‹ค๋‚ด ๋ฐ์ดํ„ฐ์…‹ ์ค‘ ํ•˜๋‚˜.3
    • ๋งˆ์ปค ๊ธฐ๋ฐ˜ ๋ชจ์…˜ ์บก์ฒ˜ ์‹œ์Šคํ…œ์œผ๋กœ 360๋งŒ ๊ฐœ์˜ ์ •ํ™•ํ•œ 3D ์ž์„ธ ์ฃผ์„ ์ œ๊ณต.
    • ์ฃผ๋กœ ๋‹จ์ผ ์ธ๋ฌผ์˜ ๋‹ค์–‘ํ•œ ์ผ์ƒ ํ™œ๋™ ํฌํ•จ, ์งง์€ ์‹œ๊ฐ„ ์˜ˆ์ธก(์˜ˆ: ๊ณผ๊ฑฐ 0.4์ดˆ ๊ด€์ฐฐ ํ›„ ๋ฏธ๋ž˜ 1์ดˆ ์˜ˆ์ธก)3 ๋ฒค์น˜๋งˆํฌ์— ์ฃผ๋กœ ์‚ฌ์šฉ.
    • ํ•œ๊ณ„: ํ†ต์ œ๋œ ํ™˜๊ฒฝ, ๋™์ž‘ ๋‹ค์–‘์„ฑ ์ œํ•œ, ํ‰๊ท ์  ์›€์ง์ž„ ์ดˆ์ .3
  • CMU Motion Capture (CMU MoCap)3:
    • ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์ธ๊ฐ„ ๋™์ž‘(์˜ˆ: ๊ฑท๊ธฐ, ๋‹ฌ๋ฆฌ๊ธฐ, ์Šคํฌ์ธ ) ํฌํ•จ ๋Œ€๊ทœ๋ชจ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์…‹.3
    • H3.6M๊ณผ ํ•จ๊ป˜ ๋‹จ๊ธฐ ๋ฐ ์žฅ๊ธฐ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์ž์ฃผ ์‚ฌ์šฉ.4
  • HumanEva3:
    • H3.6M๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์‹ค๋‚ด ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘, ๋น„๋””์˜ค์™€ ๋™๊ธฐํ™”๋œ 3D ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ ์ œ๊ณต.3
  • AMASS (Archive of Motion Capture as Surface Shapes)5:
    • ์—ฌ๋Ÿฌ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์…‹ ํ†ตํ•ฉ, SMPL ์‹ ์ฒด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ํ˜•ํƒœ๋กœ ์ œ๊ณตํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹.5
    • ๋‹ค์–‘ํ•œ ๋™์ž‘๊ณผ ์‹ ์ฒด ํ˜•ํƒœ ํฌํ•จ, ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ‰๊ฐ€์— ์œ ์šฉ.
    • MotionBERT ์‚ฌ์ „ ํ›ˆ๋ จ5 ๋ฐ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก ๋ชจ๋ธ ํ‰๊ฐ€11 ๋“ฑ์— ์‚ฌ์šฉ.
  • 3DPW (3D Poses in the Wild)1:
    • ์‹ค์ œ ์•ผ์™ธ ํ™˜๊ฒฝ์—์„œ ์ดฌ์˜๋œ ๋น„๋””์˜ค ์‹œํ€€์Šค, IMU ์„ผ์„œ ์‚ฌ์šฉ 3D ์ž์„ธ ์ •๋ณด ์ œ๊ณต.1
    • โ€˜in-the-wildโ€™ ํ™˜๊ฒฝ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‚ฌ์šฉ.
    • ํ•œ๊ณ„: ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์ƒํ˜ธ์ž‘์šฉ ์ œํ•œ (์ตœ๋Œ€ 2๋ช…).1
  • MuPoTS-3D (Multi-Person Pose Tracking in 3D)1:
    • ๋‹ค์ค‘ ์‹œ์  ๋งˆ์ปค๋ฆฌ์Šค ๋ชจ์…˜ ์บก์ฒ˜ ์‹œ์Šคํ…œ ์‚ฌ์šฉ, ์‹ค์ œ ํ™˜๊ฒฝ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ(์ตœ๋Œ€ 20๋ช…) 3D ์ž์„ธ ์บก์ฒ˜.1
    • ๊ฐ€๋ ค์ง(occlusion), ์กฐ๋ช… ๋ณ€ํ™” ๋“ฑ ํ˜„์‹ค์  ์–ด๋ ค์›€ ํฌํ•จ.12
  • JRDB-GMP (JRDB-GlobMultiPose)1:
    • ์žฅ๊ธฐ(์ตœ๋Œ€ 5์ดˆ), ๋‹ค์ค‘ ์—์ด์ „ํŠธ(์ตœ๋Œ€ 24๋ช…) ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์œ„ํ•ด JRDB ๋ฐ์ดํ„ฐ์…‹1 ๊ธฐ๋ฐ˜ ๊ตฌ์ถ•๋œ ์‹ค์ œ ํ™˜๊ฒฝ ๋ฐ์ดํ„ฐ์…‹.1
    • ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ํ•œ๊ณ„ ๊ทน๋ณต, ํ˜„์‹ค์  ์ƒํ˜ธ์ž‘์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ชฉํ‘œ.1
  • THร–R13:
    • ๋กœ๋ด‡๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ ํฌํ•จ ์‹ค๋‚ด ํ™˜๊ฒฝ, ๊ณ ์ •๋ฐ€ ๋ชจ์…˜ ์บก์ฒ˜ ์‹œ์Šคํ…œ(Qualisys) ์‚ฌ์šฉ ์ˆ˜์ง‘.13
    • ๋™์  ์ž‘์—… ํ• ๋‹น ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ƒํ˜ธ์ž‘์šฉ ์ƒํ™ฉ(์˜ˆ: ์ถ”์›”, ์ •์ง€, ๋ฐฉํ•ด) ์ƒ์„ฑ, ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ๋‹จ์กฐ๋กœ์›€ ๊ทน๋ณต ์‹œ๋„.13

๊ธฐ์กด์˜ H3.6M, CMU MoCap๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์€ ํ†ต์ œ๋œ ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘๋˜์–ด ๋™์ž‘ ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•˜๊ณ , ์ฃผ๋กœ ์งง์€ ์‹œ๊ฐ„ ์ง€ํ‰์˜ ํ‰๊ท ์ ์ธ ์›€์ง์ž„์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.1 ์ด๋Š” ๋กœ๋ด‡ ๊ณตํ•™์ด๋‚˜ HCI์™€ ๊ฐ™์ด ์žฅ๊ธฐ๊ฐ„, ๋‹ค์ˆ˜์˜ ์‚ฌ๋žŒ๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๊ฐœ์ธํ™”๋œ ์˜ˆ์ธก์ด ํ•„์š”ํ•œ ์‹ค์ œ ์‘์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค์˜ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.1

์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ์ธ์‹ํ•˜๊ณ  JRDB-GMP1๋‚˜ THร–R13๊ณผ ๊ฐ™์ด ๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ๋„์ „์ ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜๋ ค๋Š” ๋…ธ๋ ฅ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์˜ ์‹ค์ œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์ง„์ „์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ 1: ์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์ฃผ์š” ๋ฒค์น˜๋งˆํ‚น ๋ฐ์ดํ„ฐ์…‹ ๊ฐœ์š”

๋ฐ์ดํ„ฐ์…‹ ์œ ํ˜• ์ฃผ์š” ํŠน์ง• ์˜ˆ์ธก ์‹œ๊ฐ„ ์ง€ํ‰ ์ฃผ์š” ํ•œ๊ณ„์  ๊ด€๋ จ Snippet
Human3.6M ์‹ค๋‚ด, ๋งˆ์ปค ๊ธฐ๋ฐ˜ Mocap 3.6M 3D ์ž์„ธ, ๋‹จ์ผ ์ธ๋ฌผ ์œ„์ฃผ, ๋‹ค์–‘ํ•œ ํ™œ๋™ ๋‹จ๊ธฐ ์œ„์ฃผ (~1s) ํ†ต์ œ๋œ ํ™˜๊ฒฝ, ๋™์ž‘ ๋‹ค์–‘์„ฑ ๋ถ€์กฑ, ํ‰๊ท ์  ์›€์ง์ž„ ์ดˆ์  3
CMU MoCap ์‹ค๋‚ด, ๋งˆ์ปค ๊ธฐ๋ฐ˜ Mocap ๋Œ€๊ทœ๋ชจ, ๋‹ค์–‘ํ•œ ๋™์ž‘ ์œ ํ˜• (์ผ์ƒ, ์Šคํฌ์ธ  ๋“ฑ) ๋‹จ๊ธฐ/์žฅ๊ธฐ ํ†ต์ œ๋œ ํ™˜๊ฒฝ 3
HumanEva ์‹ค๋‚ด, ๋งˆ์ปค ๊ธฐ๋ฐ˜ Mocap ๋น„๋””์˜ค์™€ ๋™๊ธฐํ™”๋œ 3D Mocap ๋ฐ์ดํ„ฐ ๋‹จ๊ธฐ ์œ„์ฃผ ํ†ต์ œ๋œ ํ™˜๊ฒฝ 3
AMASS Mocap ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ SMPL ํŒŒ๋ผ๋ฏธํ„ฐ, ๋‹ค์–‘ํ•œ ๋™์ž‘ ๋ฐ ์‹ ์ฒด ํ˜•ํƒœ ๋‹ค์–‘ Mocap ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ (์ง์ ‘ ์ˆ˜์ง‘ ์•„๋‹˜) 5
3DPW ์•ผ์™ธ, ๋น„๋””์˜ค+IMU ์‹ค์ œ ์•ผ์™ธ ํ™˜๊ฒฝ, โ€˜in-the-wildโ€™ ๋‹จ๊ธฐ ์œ„์ฃผ ์ตœ๋Œ€ 2๋ช…, IMU ๊ธฐ๋ฐ˜ ์ž์„ธ ์ •ํ™•๋„ ์ด์Šˆ ๊ฐ€๋Šฅ์„ฑ 1
MuPoTS-3D ์‹ค๋‚ด/์™ธ, ๋งˆ์ปค๋ฆฌ์Šค Mocap ๋‹ค์ค‘ ์‹œ์ , ๋‹ค์ค‘ ์ธ๋ฌผ(์ตœ๋Œ€ 20๋ช…), ๊ฐ€๋ ค์ง/์กฐ๋ช… ๋ณ€ํ™” ํฌํ•จ ๋‹จ๊ธฐ ์œ„์ฃผ ๋งˆ์ปค๋ฆฌ์Šค ๊ธฐ๋ฐ˜ ์ž์„ธ ์ •ํ™•๋„ ์ด์Šˆ ๊ฐ€๋Šฅ์„ฑ 1
JRDB-GMP ์‹ค์ œ ํ™˜๊ฒฝ, ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ์žฅ๊ธฐ(์ตœ๋Œ€ 5์ดˆ), ๋‹ค์ค‘ ์—์ด์ „ํŠธ(์ตœ๋Œ€ 24๋ช…), ์‹ค์ œ ์ƒํ˜ธ์ž‘์šฉ ์žฅ๊ธฐ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹, ํ‘œ์ค€ํ™”/๊ฒ€์ฆ ํ•„์š” 1
THร–R ์‹ค๋‚ด, ๊ณ ์ •๋ฐ€ Mocap ๋กœ๋ด‡ ํฌํ•จ ํ™˜๊ฒฝ, ๋™์  ์ž‘์—… ํ• ๋‹น, ๋‹ค์–‘ํ•œ ์ƒํ˜ธ์ž‘์šฉ (์ถ”์›”, ์ •์ง€ ๋“ฑ) ์ƒ์„ฑ ์‹œ๋„ ๋‹ค์–‘ ํŠน์ • ํ™˜๊ฒฝ(ร–rebro ๋Œ€ํ•™), ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ ํ™•์žฅ ํ•„์š” 13
LaFAN1 ์‹ค๋‚ด, Mocap Ubisoft ๊ฐœ๋ฐœ, ๊ฒŒ์ž„ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ชฉ์ , ์ƒํ˜ธ์ž‘์šฉ ํฌํ•จ ๋‹ค์–‘ ํŠน์ • ๊ฒŒ์ž„/์• ๋‹ˆ๋ฉ”์ด์…˜ ๋„๋ฉ”์ธ ํŽธํ–ฅ ๊ฐ€๋Šฅ์„ฑ 9
KIT-ML ์‹ค๋‚ด, Mocap ํ…์ŠคํŠธ-๋™์ž‘ ์Œ ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์–‘ ํ…์ŠคํŠธ ์ฃผ์„ ๊ธฐ๋ฐ˜, ์–ธ์–ด-๋™์ž‘ ๋งคํ•‘ ์ดˆ์  8
HumanAct12 Mocap ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ 12๊ฐœ ์•ก์…˜ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜ ๋‹ค์–‘ ์•ก์…˜ ๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜, ํŠน์ • ์•ก์…˜ ํŽธํ–ฅ ๊ฐ€๋Šฅ์„ฑ 8
UESTC ์‹ค๋‚ด, Mocap 40๊ฐœ ์•ก์…˜ ํด๋ž˜์Šค, 40๋ช… ํ”ผํ—˜์ž ๋‹ค์–‘ ์•ก์…˜ ๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜, ํŠน์ • ์•ก์…˜ ํŽธํ–ฅ ๊ฐ€๋Šฅ์„ฑ 8
HumanML3D Mocap ๋ฐ์ดํ„ฐ + ํ…์ŠคํŠธ AMASS/HumanAct12 ๊ธฐ๋ฐ˜, ํ…์ŠคํŠธ ์ฃผ์„ ์žฌ์ž‘์—… ๋‹ค์–‘ ํ…์ŠคํŠธ ์ฃผ์„ ํ’ˆ์งˆ/์ผ๊ด€์„ฑ ์ด์Šˆ ๊ฐ€๋Šฅ์„ฑ 8

IV. ๋น„๊ต ์—ฐ๊ตฌ ๊ฒฐ๊ณผ ๋ฐ ์„ฑ๋Šฅ ๋ถ„์„

์ตœ๊ทผ ๋น„๊ต ์—ฐ๊ตฌ๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ๋“ค์˜ ์„ฑ๋Šฅ์„ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ๋ฐฉ๋ฒ•๋ก ์˜ ๊ฐ•์ ๊ณผ ์•ฝ์ , ๊ทธ๋ฆฌ๊ณ  ํŠน์ • ์กฐ๊ฑด์—์„œ์˜ ์šฐ์ˆ˜์„ฑ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

A. ๋ฐฉ๋ฒ•๋ก ๋ณ„ ์„ฑ๋Šฅ ๋ถ„์„

  • RNNs/LSTMs:
    • ์ข…์ข… ๋น„๊ต ์—ฐ๊ตฌ์—์„œ ๊ธฐ์ค€์„ (baseline)์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    • ๋‹จ๊ธฐ ์˜ˆ์ธก์—์„œ๋Š” ํ•ฉ๋ฆฌ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ์žฅ๊ธฐ ์˜ˆ์ธก์—์„œ๋Š” ์˜ค์ฐจ ๋ˆ„์ ๊ณผ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•™์Šต์˜ ์–ด๋ ค์›€์œผ๋กœ ์ธํ•ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.4
    • GCN๊ณผ ๊ฒฐํ•ฉ๋œ DMST-GRNN ๋ชจ๋ธ์€ H3.6M ๋ฐ CMU MoCap ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋‹จ์ˆœ RNN๋ณด๋‹ค ๋‹จ๊ธฐ ๋ฐ ์žฅ๊ธฐ ์˜ˆ์ธก ๋ชจ๋‘์—์„œ ๊ฐœ์„ ๋œ ํ‰๊ท  ๊ฐ๋„ ์˜ค์ฐจ(MAE)๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.4
  • GCNs:
    • ๊ณจ๊ฒฉ ๊ตฌ์กฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ ๊ณต๊ฐ„์  ๊ด€๊ณ„๋ฅผ ์ž˜ ํฌ์ฐฉํ•˜๋ฉฐ, ํŠนํžˆ ๊ตฌ์กฐ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ์˜ˆ์ธก ์ž‘์—…์—์„œ ๊ฐ•์ ์„ ๋ณด์ž…๋‹ˆ๋‹ค.4
    • ํ•˜์ง€๋งŒ ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์‹œ๊ฐ„์ ์ธ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํฌ์ฐฉ์—๋Š” ํŠธ๋žœ์Šคํฌ๋จธ๋ณด๋‹ค ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜์ธ MotionBERT๊ฐ€ GCN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค(ST-GCN, 2s-AGCN)๋ณด๋‹ค ์•ก์…˜ ์ธ์‹ ์ž‘์—…์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ์ 5์€ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์‹œ๊ณต๊ฐ„์  ํŠน์ง•์„ ๋” ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • Transformers:
    • ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ๋Šฅ๋ ฅ ๋•๋ถ„์— ์ž์„ธ ์ถ”์ •, ๋™์ž‘ ์ƒ์„ฑ ๋“ฑ ๊ด€๋ จ ๋ถ„์•ผ์—์„œ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ  ์žˆ์œผ๋ฉฐ8, ์ด๋Š” ์ž์„ธ ์˜ˆ์ธก ๋ถ„์•ผ์—์„œ๋„ ๋†’์€ ์ž ์žฌ๋ ฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • MotionBERT๋Š” H3.6M ๋ฐ์ดํ„ฐ์…‹์—์„œ 3D ์ž์„ธ ์ถ”์ •(MPJPE ๊ธฐ์ค€) SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ5, MDM์€ HumanML3D, KIT, HumanAct12, UESTC ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ…์ŠคํŠธ/์•ก์…˜ ์กฐ๊ฑด๋ถ€ ๋™์ž‘ ์ƒ์„ฑ ๊ด€๋ จ ์ง€ํ‘œ(FID, R-Precision, Diversity ๋“ฑ)์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.7
    • ์ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ๋ณต์žกํ•œ ์‹œ๊ณต๊ฐ„์  ํŒจํ„ด ํ•™์Šต์— ํšจ๊ณผ์ ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.4
  • Diffusion Models:
    • ๋น„๊ต ์—ฐ๊ตฌ๋Š” ์ฃผ๋กœ ๋™์ž‘ โ€˜์ƒ์„ฑโ€™ ์ž‘์—…์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์ง€๋งŒ, ๊ทธ ๊ฒฐ๊ณผ๋Š” โ€˜์˜ˆ์ธกโ€™ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ํ†ต์ฐฐ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • Diffusion ๋ชจ๋ธ์€ ์ƒ์„ฑ ํ’ˆ์งˆ๊ณผ ๋‹ค์–‘์„ฑ ์ธก๋ฉด์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.8
    • ํŠนํžˆ PhysDiff๋Š” MDM์ด๋‚˜ MotionDiffuse์™€ ๊ฐ™์€ ๊ธฐ์กด ํ™•์‚ฐ ๋ชจ๋ธ ๋Œ€๋น„ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•˜๋ฉด์„œ๋„ FID๋‚˜ ๊ด€๋ จ์„ฑ ์ ์ˆ˜๋Š” ๊ฒฝ์Ÿ๋ ฅ ์žˆ๊ฒŒ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.8
    • AAMDM์€ ๋А๋ฆฐ ์ƒ˜ํ”Œ๋ง ์†๋„๋ผ๋Š” ํ™•์‚ฐ ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ๊ฐœ์„ ํ•˜์—ฌ, AMDM200๊ณผ ์œ ์‚ฌํ•œ ํ’ˆ์งˆ๊ณผ ๋‹ค์–‘์„ฑ์„ ํ›จ์”ฌ ๋†’์€ FPS๋กœ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.9
    • ์ด๋Š” ํ™•์‚ฐ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก์ด ๋†’์€ ์ถฉ์‹ค๋„์™€ ๋‹ค์–‘์„ฑ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํšจ์œจ์„ฑ๊ณผ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ์€ ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•œ ์˜์—ญ์ž„์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • Decoupled/Hierarchical Models:
    • ์ „์—ญ ๊ฒฝ๋กœ์™€ ์ง€์—ญ ์ž์„ธ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋ชจ๋ธ๋“ค์€ ํŠนํžˆ ์žฅ๊ธฐ, ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด๊ณ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.1
    • T2P ๋ชจ๋ธ์€ JRDB-GMP ๋ฐ ์ด์ „ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ SOTA ์„ฑ๋Šฅ์„ ์ฃผ์žฅํ–ˆ์œผ๋ฉฐ10, Parsaeifard ๋“ฑ์˜ VAE ๊ธฐ๋ฐ˜ ๋ถ„๋ฆฌ ๋ชจ๋ธ๋„ ๊ธฐ์ค€ ๋ชจ๋ธ ๋Œ€๋น„ ์šฐ์ˆ˜์„ฑ์„ ์ฃผ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.2
    • ์ด๋Š” ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋ถ„ํ•ดํ•˜์—ฌ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

B. ๋น„๊ต ๋ฌธํ—Œ์—์„œ ํ™•์ธ๋œ ๊ฐ•์ ๊ณผ ์•ฝ์ 

  • RNNs:
    • ๊ฐ•์ : ๊ตฌํ˜„ ์šฉ์ด์„ฑ, ์งง์€ ์‹œํ€€์Šค์— ์ ํ•ฉ.
    • ์•ฝ์ : ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค, ์˜ค์ฐจ ๋ˆ„์ , ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ๋ชจ๋ธ๋ง ์ทจ์•ฝ.4
  • GCNs:
    • ๊ฐ•์ : ๊ณจ๊ฒฉ ๊ตฌ์กฐ ๋ช…์‹œ์  ๋ชจ๋ธ๋ง.
    • ์•ฝ์ : ์ˆœ์ˆ˜ ์‹œ๊ฐ„์  ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•™์Šต์— ํŠธ๋žœ์Šคํฌ๋จธ๋ณด๋‹ค ์•ฝํ•  ์ˆ˜ ์žˆ์Œ.
  • Transformers:
    • ๊ฐ•์ : ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ๋ชจ๋ธ๋ง ํƒ์›”, ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ.
    • ์•ฝ์ : ๊ณ„์‚ฐ ๋น„์šฉ ๋†’์Œ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ํ•„์š” ๊ฐ€๋Šฅ์„ฑ, GCN ๋Œ€๋น„ ๋‚ด์žฌ์  ๊ตฌ์กฐ ํŽธํ–ฅ ๋ถ€์กฑ.
  • VAEs/GANs:
    • ๊ฐ•์ : ๋ถˆํ™•์‹ค์„ฑ/๋‹ค์ค‘ ๋ชจ๋“œ ๋ชจ๋ธ๋ง.
    • ์•ฝ์ : ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ(GANs), ํ™•์‚ฐ ๋ชจ๋ธ ๋Œ€๋น„ ํ‘œํ˜„๋ ฅ ์ œํ•œ ๋˜๋Š” ๋ชจ๋“œ ๋ถ•๊ดด ๊ฐ€๋Šฅ์„ฑ.
  • Diffusion Models:
    • ๊ฐ•์ : ์ตœ์ฒจ๋‹จ ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฐ ๋‹ค์–‘์„ฑ, ์œ ์—ฐํ•œ ์กฐ๊ฑด ๋ถ€์—ฌ.
    • ์•ฝ์ : ๋А๋ฆฐ ์ƒ˜ํ”Œ๋ง ์†๋„(๊ฐœ์„  ์ค‘14), ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ํ™•๋ณด ์œ„ํ•œ ๋ณ„๋„ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ•„์š”8, ์ง์ ‘์ ์ธ ์˜ˆ์ธก ๋น„๊ต ์—ฐ๊ตฌ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋ถ€์กฑ.
  • Deterministic Models:
    • ๊ฐ•์ : ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ์šฉ์ด(MPJPE ์‚ฌ์šฉ).
    • ์•ฝ์ : ๋ฏธ๋ž˜ ๋ถˆํ™•์‹ค์„ฑ ํฌ์ฐฉ ์‹คํŒจ, ์ง€๋‚˜์น˜๊ฒŒ ๋ถ€๋“œ๋Ÿฝ๊ฑฐ๋‚˜ ํ‰๊ท ์ ์ธ ์˜ˆ์ธก ์ƒ์„ฑ ๊ฒฝํ–ฅ.4
  • Stochastic Models:
    • ๊ฐ•์ : ๋‹ค์–‘ํ•œ ๋ฏธ๋ž˜๋ฅผ ๋ชจ๋ธ๋งํ•˜์—ฌ ํ˜„์‹ค์„ฑ ๋†’์Œ.
    • ์•ฝ์ : ํ‰๊ฐ€ ์–ด๋ ค์›€(๋ถ„ํฌ ์ง€ํ‘œ ํ•„์š”), ์ œ์–ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ.

C. ์ตœ๊ทผ ๋น„๊ต ์—ฐ๊ตฌ์˜ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ ํ•˜์ด๋ผ์ดํŠธ

  • ๋‹จ๊ธฐ ์˜ˆ์ธก (H3.6M/CMU): GCN ๊ธฐ๋ฐ˜(์˜ˆ: DMST-GRNN4) ๋ฐ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ด์ „ RNN ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์ธ SOTA MPJPE ๊ฐ’์€ ์ •ํ™•ํ•œ ์‹œ๊ฐ„ ๋ฒ”์œ„์™€ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.
  • ์žฅ๊ธฐ ์˜ˆ์ธก (H3.6M/CMU/JRDB-GMP): ๋™์ž‘ ๋งฅ๋ฝ4, ์ƒํ˜ธ์ž‘์šฉ ์ธ์‹1, ๋ชฉํ‘œ ์กฐ๊ฑดํ™”1, ๋˜๋Š” ๋ถ„๋ฆฌ ๊ธฐ๋ฒ•1์„ ํ†ตํ•ฉํ•œ ๋ชจ๋ธ๋“ค์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. T2P ๋ชจ๋ธ์€ ์žฅ๊ธฐ, ๋‹ค์ค‘ ์—์ด์ „ํŠธ ๋ฐ์ดํ„ฐ์…‹์ธ JRDB-GMP์—์„œ SOTA ์„ฑ๋Šฅ์„ ์ฃผ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.10
  • ์ƒ์„ฑ ํ’ˆ์งˆ/๋‹ค์–‘์„ฑ (HumanML3D/KIT): MDM7 ๋ฐ PhysDiff8์™€ ๊ฐ™์€ ํ™•์‚ฐ ๋ชจ๋ธ์€ ํ…์ŠคํŠธ/์•ก์…˜ ์กฐ๊ฑด๋ถ€ โ€˜์ƒ์„ฑโ€™ ์ž‘์—…์—์„œ SOTA ์ˆ˜์ค€์˜ FID, ๋‹ค์–‘์„ฑ, ๋‹ค์ค‘ ๋ชจ๋“œ ์ ์ˆ˜๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ ํ’ˆ์งˆ ํ™•๋ฅ ๋ก ์  โ€˜์˜ˆ์ธกโ€™์— ๋Œ€ํ•œ ๊ฐ•๋ ฅํ•œ ์ž ์žฌ๋ ฅ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ: PhysDiff8๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹(HumanML3D, HumanAct12, UESTC)์—์„œ ๊ธฐ์ค€ ํ™•์‚ฐ ๋ชจ๋ธ(MDM, MotionDiffuse) ๋Œ€๋น„ ๋ฌผ๋ฆฌ์  ์˜ค๋ฅ˜(๋ฐœ ๋ฏธ๋„๋Ÿฌ์ง, ์ง€๋ฉด ํ†ต๊ณผ, ๊ณต์ค‘ ๋ถ€์–‘)๋ฅผ 78%~94%๊นŒ์ง€ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ํšจ์œจ์„ฑ: AAMDM9์€ LaFAN1 ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‘œ์ค€ ์ž๊ธฐํšŒ๊ท€ ํ™•์‚ฐ ๋ชจ๋ธ(AMDM200) ๋Œ€๋น„ ํ’ˆ์งˆ/๋‹ค์–‘์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์•ฝ 40๋ฐฐ ๋น ๋ฅธ ์†๋„ ํ–ฅ์ƒ(173 FPS)์„ ๋ณด์—ฌ ์‹ค์‹œ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. EMDM14 ์—ญ์‹œ ์‹ค์‹œ๊ฐ„ ์ƒ์„ฑ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋น„๊ต ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•ด ๋ณผ ๋•Œ, ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํฌ์ฐฉ์— ๊ฐ•๋ ฅํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์ด์ง€๋งŒ4, ํŠนํžˆ ๋ณต์žกํ•œ ์žฅ๊ธฐ ์˜ˆ์ธก์ด๋‚˜ ์ƒํ˜ธ์ž‘์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” GCN์„ ํ†ตํ•œ ๊ตฌ์กฐ ์ •๋ณด ํ™œ์šฉ4, ๋ช…์‹œ์ ์ธ ์ „์—ญ/์ง€์—ญ ๋ถ„๋ฆฌ1, ๋˜๋Š” ๋ฌผ๋ฆฌ ๋ฒ•์น™ ์•ˆ๋‚ด8์™€ ๊ฐ™์ด ๋„๋ฉ”์ธ ์ง€์‹์„ ํ†ตํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ์ข…์ข… ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ธ๋‹ค๋Š” ์ ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ€์ง„ ์•„ํ‚คํ…์ฒ˜์™€ ๋ช…์‹œ์ ์ธ ๊ตฌ์กฐ์  ๋˜๋Š” ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ธ๊ฐ„ ๋™์ž‘ ์˜ˆ์ธก์˜ ๋ฏธ๋ฌ˜ํ•œ ์ธก๋ฉด์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‹ค๋ฃจ๋Š” ๋ฐ ์ค‘์š”ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ์˜ˆ์ธก ํ’ˆ์งˆ/๋‹ค์–‘์„ฑ๊ณผ ํšจ์œจ์„ฑ ๊ฐ„์˜ ๋ช…๋ฐฑํ•œ ์ƒ์ถฉ ๊ด€๊ณ„14๋Š” ์‹ค์‹œ๊ฐ„ ์˜ˆ์ธก ์‘์šฉ์„ ์œ„ํ•œ ๊ฐ€์†ํ™” ๊ธฐ์ˆ  ์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ์„ ๋ถ€๊ฐ์‹œํ‚จ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ˆœ์ˆ˜ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ฃผ์š” ์‹คํŒจ ๋ชจ๋“œ ์ค‘ ํ•˜๋‚˜์ธ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ๋ฌธ์ œ8๋Š” PhysDiff8์™€ ๊ฐ™์€ ๋ช…์‹œ์  ํ•ด๊ฒฐ์ฑ…์„ ํ†ตํ•ด ๋‹ค๋ฅธ ์ง€ํ‘œ์— ํฐ ์†์ƒ ์—†์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ๊ณ ๋ คํ•œ ๋ชจ๋ธ๋ง์ด ํ‘œ์ค€์ ์ธ ๊ณ ๋ ค ์‚ฌํ•ญ์ด ๋˜์–ด์•ผ ํ•จ์„ ์•”์‹œํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ 3: ์ฃผ์š” ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ๋ชจ๋ธ ๊ณ„์—ด๋ณ„ ๋น„๊ต ์„ฑ๋Šฅ ์š”์•ฝ

๋ชจ๋ธ ๊ณ„์—ด ์ฃผ์š” ์˜ˆ์‹œ ๋ชจ๋ธ/๋…ผ๋ฌธ ๊ฐ•์  (๋น„๊ต ๊ธฐ๋ฐ˜) ์•ฝ์  (๋น„๊ต ๊ธฐ๋ฐ˜) ์ฃผ์š” ๋ฒค์น˜๋งˆํฌ/์ž‘์—… ์„ฑ๋Šฅ ์š”์•ฝ (์˜ˆ์‹œ) ๊ด€๋ จ Snippet
RNN ๊ธฐ๋ฐ˜ LSTM2, DMST-GRNN (GCN+RNN)4 ๋‹จ์ˆœ์„ฑ, ๋‹จ๊ธฐ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ ์žฅ๊ธฐ ์˜์กด์„ฑ ์•ฝํ•จ, ์˜ค์ฐจ ๋ˆ„์  DMST-GRNN: H3.6M/CMU ๋‹จ๊ธฐ/์žฅ๊ธฐ MAE ๊ฐœ์„  (vs RNN) 2
GCN ๊ธฐ๋ฐ˜ ST-GCN, 2s-AGCN5, DMST-GRNN4 ๊ณจ๊ฒฉ ๊ตฌ์กฐ ๋ช…์‹œ์  ๋ชจ๋ธ๋ง ์ˆœ์ˆ˜ ์‹œ๊ฐ„์  ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ์•ฝํ•  ์ˆ˜ ์žˆ์Œ DMST-GRNN: H3.6M/CMU MAE SOTA (๋‹น์‹œ) 2
Transformer ๊ธฐ๋ฐ˜ MotionBERT5, MDM7, T2P10 ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ๋ชจ๋ธ๋ง ํƒ์›” ๊ณ„์‚ฐ ๋น„์šฉ ๋†’์Œ, ๊ตฌ์กฐ ํŽธํ–ฅ ๋ถ€์กฑ MotionBERT: H3.6M 3D ์ถ”์ • MPJPE SOTA.
MDM: HumanML3D/KIT/HumanAct12/UESTC ์ƒ์„ฑ FID/Diversity SOTA.
T2P: JRDB-GMP ์˜ˆ์ธก SOTA ์ฃผ์žฅ.
4
VAE/GAN ๊ธฐ๋ฐ˜ Parsaeifard et al.2 ๋ถˆํ™•์‹ค์„ฑ/๋‹ค์ค‘ ๋ชจ๋“œ ๋ชจ๋ธ๋ง ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ, ํ‘œํ˜„๋ ฅ ์ œํ•œ ๊ฐ€๋Šฅ์„ฑ Parsaeifard: ๋ถ„๋ฆฌ ๋ชจ๋ธ ๊ธฐ์ค€์„  ๋Œ€๋น„ ์šฐ์ˆ˜ ์ฃผ์žฅ 2
Diffusion ๊ธฐ๋ฐ˜ MDM7, MotionDiffuse8, PhysDiff8, AAMDM9 ์ƒ์„ฑ ํ’ˆ์งˆ/๋‹ค์–‘์„ฑ SOTA, ์œ ์—ฐํ•œ ์กฐ๊ฑด ๋ถ€์—ฌ ๋А๋ฆฐ ์ƒ˜ํ”Œ๋ง, ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ํ™•๋ณด ์–ด๋ ค์›€ PhysDiff: ๋ฌผ๋ฆฌ์  ์˜ค๋ฅ˜ ๋Œ€ํญ ๊ฐ์†Œ (vs MDM/MD).
AAMDM: AMDM200 ๋Œ€๋น„ ~40๋ฐฐ ๋น ๋ฆ„ (FPS)
6
Decoupled/Hierarchical T2P10, Parsaeifard et al.2 ๋ณต์žก์„ฑ ๊ด€๋ฆฌ ์šฉ์ด (ํŠนํžˆ ์žฅ๊ธฐ/๋‹ค์ค‘ ์—์ด์ „ํŠธ) ๋ถ„๋ฆฌ ๊ณผ์ •์—์„œ์˜ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ์„ฑ T2P: JRDB-GMP ๋ฐ ์ด์ „ ๋ฐ์ดํ„ฐ์…‹ SOTA ์ฃผ์žฅ 1

V. ์ฃผ์š” ์—ฐ๊ตฌ ๋™ํ–ฅ ๋ฐ ์ง€์†์ ์ธ ๊ณผ์ œ

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ถ„์•ผ๋Š” ์‘์šฉ ๋ถ„์•ผ์˜ ์š”๊ตฌ์‚ฌํ•ญ ์ฆ๊ฐ€์™€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์— ํž˜์ž…์–ด ๋น ๋ฅด๊ฒŒ ์ง„ํ™”ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช‡ ๊ฐ€์ง€ ์ฃผ์š” ์—ฐ๊ตฌ ๋™ํ–ฅ๊ณผ ํ•จ๊ป˜ ์—ฌ์ „ํžˆ ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

A. ๋ถ„์•ผ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ์ฃผ์š” ๋™ํ–ฅ

  • ์žฅ๊ธฐ ์˜ˆ์ธก (Long-Term Prediction): ์˜ˆ์ธก ์‹œ๊ฐ„ ์ง€ํ‰์„ 1์ดˆ ์ด๋‚ด์—์„œ ์ˆ˜ ์ดˆ ์ด์ƒ์œผ๋กœ ํ™•์žฅํ•˜๋ ค๋Š” ๋…ธ๋ ฅ.1 ์ด๋Š” ๋ถˆํ™•์‹ค์„ฑ ์ฒ˜๋ฆฌ์™€ ๋” ๋†’์€ ์ˆ˜์ค€์˜ ๊ณ„ํš ๋ฐ ์˜๋„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์ƒํ˜ธ์ž‘์šฉ (Multi-Agent Interaction): ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ๋™์‹œ์— ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์ƒํ™ฉ์—์„œ์˜ ์›€์ง์ž„์„ ๋ชจ๋ธ๋งํ•˜๊ณ  ์˜ˆ์ธก.1 ํ˜„์‹ค์ ์ธ ์žฅ๋ฉด ์ดํ•ด์— ํ•„์ˆ˜์ ์ด๋ฉฐ, ์ ํ•ฉํ•œ ๋ฐ์ดํ„ฐ์…‹1๊ณผ ์ƒํ˜ธ์ž‘์šฉ ์ธ์‹ ๋ชจ๋ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ํ™•๋ฅ ๋ก ์ /๋‹ค์–‘ํ•œ ์˜ˆ์ธก (Probabilistic/Diverse Forecasting): ๊ฒฐ์ •๋ก ์  ๋‹จ์ผ ์˜ˆ์ธก ๋Œ€์‹ , ๋ฏธ๋ž˜์˜ ๋ณธ์งˆ์ ์ธ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ฐ˜์˜ํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜๋ฅผ ์ƒ์„ฑ.1 VAE, GAN, ํ™•์‚ฐ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ์— ์˜ํ•ด ์ฃผ๋„๋ฉ๋‹ˆ๋‹ค.
  • ๊ฐœ์ธํ™” (Personalization): ํŠนํžˆ ์žฅ๊ธฐ๊ฐ„์˜ HCI ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ฐœ์ธ์˜ ๊ณ ์œ ํ•œ ์›€์ง์ž„ ์Šคํƒ€์ผ, ์‹ ์ฒด ๋น„์œจ, ํ–‰๋™ ํŠน์„ฑ ๋“ฑ์— ์˜ˆ์ธก ๋ชจ๋ธ์„ ์ ์‘์‹œํ‚ค๋Š” ์—ฐ๊ตฌ.3 ์˜จ๋ผ์ธ ์ ์‘ ๋˜๋Š” ๊ฐœ์ธ๋ณ„ ๋ชจ๋ธ ํ•™์Šต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์žฅ๋ฉด/๋งฅ๋ฝ/๋ฌผ๋ฆฌ ์ธ์‹ (Scene/Context/Physics Awareness): 3D ํ™˜๊ฒฝ ์ •๋ณด15, ๋ฌผ์ฒด์™€์˜ ์ƒํ˜ธ์ž‘์šฉ16, ๋˜๋Š” ๋ฌผ๋ฆฌ ๋ฒ•์น™8์„ ํ†ตํ•ฉํ•˜์—ฌ ๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ํ™˜๊ฒฝ์— ์ ํ•ฉํ•œ ์˜ˆ์ธก์„ ์ƒ์„ฑํ•˜๋ ค๋Š” ์‹œ๋„.
  • ์กฐ๊ฑด๋ถ€ ์˜ˆ์ธก (Conditioned Prediction): ํ…์ŠคํŠธ8, ์•ก์…˜8, ๊ฒฝ๋กœ1, ์ด๋ฏธ์ง€17 ๋“ฑ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ์กฐ๊ฑด์— ๋”ฐ๋ผ ๋™์ž‘์„ ์ƒ์„ฑ/์˜ˆ์ธก. ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ๊ณผ ๊ฒฝ๊ณ„๊ฐ€ ๋ชจํ˜ธํ•˜์ง€๋งŒ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์˜ˆ์ธก๊ณผ ๊ด€๋ จ์ด ๊นŠ์Šต๋‹ˆ๋‹ค.
  • ํ–ฅ์ƒ๋œ ์•„ํ‚คํ…์ฒ˜ (Improved Architectures): GCN, ํŠธ๋žœ์Šคํฌ๋จธ, ํ™•์‚ฐ ๋ชจ๋ธ์˜ ์ง€์†์ ์ธ ํƒ๊ตฌ ๋ฐ ์ด๋“ค์˜ ๊ฐ•์  ๊ฒฐํ•ฉ ๋˜๋Š” ๋„๋ฉ”์ธ ์ง€์‹(์˜ˆ: ๋ถ„๋ฆฌ, ๋ฌผ๋ฆฌ) ํ†ตํ•ฉ. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ.5

์ด๋Ÿฌํ•œ ์ฃผ์š” ๋™ํ–ฅ๋“ค(์žฅ๊ธฐ, ๋‹ค์ค‘ ์—์ด์ „ํŠธ, ํ™•๋ฅ ๋ก ์ , ๋งฅ๋ฝ ์ธ์‹)์€ ์„œ๋กœ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์—ฐ๊ด€๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋ณต์žกํ•˜๊ณ  ์ƒํ˜ธ์ž‘์šฉ์ ์ธ ํ™˜๊ฒฝ์—์„œ ๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ์œ ์šฉํ•œ ์˜ˆ์ธก์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ ค๋Š” ๊ณต๋™์˜ ๋ชฉํ‘œ๋ฅผ ํ–ฅํ•ด ๋‚˜์•„๊ฐ€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์žฅ๊ธฐ ์˜ˆ์ธก์€ ํ•„์—ฐ์ ์œผ๋กœ ๋งฅ๋ฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ์š”๊ตฌํ•˜๋ฉฐ, ์‹ค์ œ ์ƒํ˜ธ์ž‘์šฉ์€ ์—ฌ๋Ÿฌ ์—์ด์ „ํŠธ๋ฅผ ํฌํ•จํ•˜๊ณ  ๋ฏธ๋ž˜๋Š” ๋ถˆํ™•์‹คํ•˜๋ฏ€๋กœ ํ™•๋ฅ ๋ก ์  ์ ‘๊ทผ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ฐ ๋™ํ–ฅ์€ ๋‹จํŽธ์ ์ธ ๋ฐœ์ „์ด ์•„๋‹ˆ๋ผ, ํ†ต์ œ๋œ ํ™˜๊ฒฝ์—์„œ์˜ ๋‹จ์ˆœํ•œ ๊ธฐ๊ตฌํ•™์  ์™ธ์‚ฝ์„ ๋„˜์–ด์„œ๋ ค๋Š” ํฌ๊ด„์ ์ธ ๋ชฉํ‘œ์˜ ์—ฌ๋Ÿฌ ์ธก๋ฉด์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

B. ๋น„๊ต ๋ฆฌ๋ทฐ์—์„œ ๊ฐ•์กฐ๋œ ์ฃผ์š” ์žฅ์• ๋ฌผ

  • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ, ์–‘, ํŽธํ–ฅ (Data Quality, Quantity, and Bias): ํŠนํžˆ ๋‹ค์ค‘ ์—์ด์ „ํŠธ, ์žฅ๊ธฐ, ์‹ค์ œ ํ™˜๊ฒฝ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์œ„ํ•œ ๋” ํฌ๊ณ , ๋‹ค์–‘ํ•˜๋ฉฐ, ์ •ํ™•ํ•˜๊ฒŒ ์ฃผ์„ ์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์…‹์˜ ํ•„์š”์„ฑ.1 ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์€ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๋ฉฐ1, ์‹ค์ œ ๋ฐ์ดํ„ฐ(ground truth) ์ •ํ™•๋„๋„ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.13
  • ํ‰๊ฐ€์˜ ์—„๋ฐ€์„ฑ (Evaluation Rigor): MPJPE์™€ ๊ฐ™์€ ๋‹จ์ˆœ ์ง€ํ‘œ์— ๋Œ€ํ•œ ๊ณผ๋„ํ•œ ์˜์กด.18 ์ •ํ™•๋„, ๋‹ค์–‘์„ฑ, ํƒ€๋‹น์„ฑ, ๊ด€๋ จ์„ฑ, ํšจ์œจ์„ฑ์„ ํฌ๊ด„ํ•˜๋Š” ์ข…ํ•ฉ์ ์ธ ํ‰๊ฐ€ ์ฒด๊ณ„์˜ ํ•„์š”์„ฑ.12 ๋ฒค์น˜๋งˆํ‚น ํ”„๋กœํ† ์ฝœ์˜ ๋ฐœ์ „ ํ•„์š”.3
  • ์ผ๋ฐ˜ํ™” (Generalization): ํŠน์ • ๋ฐ์ดํ„ฐ์…‹(์ฃผ๋กœ ๋ชจ์…˜ ์บก์ฒ˜)์—์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ ์‹œ๋‚˜๋ฆฌ์˜ค(โ€˜in-the-wildโ€™)๋กœ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ. ๋„๋ฉ”์ธ ๊ฐ„๊ทน ๋ฌธ์ œ.19
  • ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ (Physical Plausibility): ํŠนํžˆ ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ ์žฅ๊ธฐ ์˜ˆ์ธก์—์„œ ์˜ˆ์ธก ๊ฒฐ๊ณผ๊ฐ€ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ์ค€์ˆ˜ํ•˜๊ณ  ์ธ๊ณต์ ์ธ ์˜ค๋ฅ˜(artifact)๋ฅผ ํšŒํ”ผํ•˜๋„๋ก ๋ณด์žฅํ•˜๋Š” ๊ฒƒ์ด ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๊ณผ์ œ.8
  • ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ (Controllability): ์ƒ์„ฑ ๋ชจ๋ธ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ์–ดํ•˜์—ฌ ํŠน์ • ์›ํ•˜๋Š” ๋™์ž‘(์กฐ๊ฑด๋ถ€ ์˜ˆ์ธก ๊ด€๋ จ)์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์˜ ์–ด๋ ค์›€.7
  • ๊ณ„์‚ฐ ๋น„์šฉ / ์‹ค์‹œ๊ฐ„ ์ œ์•ฝ (Computational Cost / Real-time Constraints): ํŠธ๋žœ์Šคํฌ๋จธ, ํŠนํžˆ ํ™•์‚ฐ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋ณต์žกํ•œ ๋ชจ๋ธ์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์•„ ์‹ค์‹œ๊ฐ„ ์‘์šฉ์„ ์ €ํ•ดํ•  ์ˆ˜ ์žˆ์Œ.7
  • ๊ฐ€๋ ค์ง ๋ฐ ๋…ธ์ด์ฆˆ ์ฒ˜๋ฆฌ (Handling Occlusion and Noise): ์‹ค์ œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(์ž์„ธ ์ถ”์ • ๊ฒฐ๊ณผ)๋Š” ๊ฐ€๋ ค์ง์œผ๋กœ ์ธํ•ด ์ข…์ข… ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ๋ถˆ์™„์ „ํ•จ.6 ์˜ˆ์ธก ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ ๋ถˆ์™„์ „์„ฑ์— ๊ฐ•์ธํ•ด์•ผ ํ•œ๋‹ค.

์ง€์†์ ์ธ ๊ณผ์ œ ์ค‘ ์ƒ๋‹น์ˆ˜๊ฐ€ ๋ฐ์ดํ„ฐ ๋ฐ ํ‰๊ฐ€์™€ ๊ด€๋ จ๋˜์–ด ์žˆ๋‹ค๋Š” ์ 1์€ ์ฃผ๋ชฉํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์—ฐ๊ตฌ ๋ฐœ์ „์ด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์™€ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„์— ์˜ํ•ด์„œ๋„ ์ œ์•ฝ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•˜๊ฑฐ๋‚˜ ํ‰๊ฐ€ ์ง€ํ‘œ๊ฐ€ ํƒ€๋‹น์„ฑ ๋˜๋Š” ์ƒํ˜ธ์ž‘์šฉ ํ’ˆ์งˆ๊ณผ ๊ฐ™์€ ์ค‘์š”ํ•œ ์ธก๋ฉด์„ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด, ์ •๊ตํ•œ ๋ชจ๋ธ์ด๋ผ ํ• ์ง€๋ผ๋„ ์‹ค์ œ ์œ ์šฉ์„ฑ์„ ์œ„ํ•ด ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜๊ฑฐ๋‚˜ ํ‰๊ฐ€๋˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹1๊ณผ ์ง€ํ‘œ8 ๊ฐœ๋ฐœ์— ๋Œ€ํ•œ ๊ฐ•์กฐ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ ์ธ์‹์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ๋ฌธ์ œ8๋Š” ์œ ์—ฐํ•œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ ๋ฌผ๋ฆฌ ์„ธ๊ณ„์˜ ์—„๊ฒฉํ•œ ์ œ์•ฝ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๊ธด์žฅ์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ๋Š” ๋Šฅ์ˆ™ํ•˜์ง€๋งŒ, ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„๋˜์ง€ ์•Š์€ ์—„๊ฒฉํ•œ ์ œ์•ฝ์„ ๊ฐ•์ œํ•˜๋Š” ๋ฐ๋Š” ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์—”์ง„์€ ๋ฌผ๋ฆฌ ๋ฒ•์น™์— ๋Œ€ํ•œ ๊ธฐ์ค€ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ, ํ†ตํ•ฉ ์‹œ๋ฎฌ๋ ˆ์ด์…˜8์ด๋‚˜ ๋ฌผ๋ฆฌ ์ •๋ณด ๊ธฐ๋ฐ˜ ์†์‹ค/๊ฐ•ํ™”ํ•™์Šต16 ๋“ฑ์„ ํ†ตํ•ด ์ด ๋‘˜์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ง„์ •์œผ๋กœ ํ˜„์‹ค์ ์ธ ๋™์ž‘์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

VI. ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ๋ฐ ๊ฒฐ๋ก 

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ๋ถ„์•ผ์˜ ๋น„๊ต ์—ฐ๊ตฌ๋“ค์€ ํ˜„์žฌ ๊ธฐ์ˆ  ์ˆ˜์ค€์„ ์กฐ๋ช…ํ•˜๊ณ  ํ–ฅํ›„ ์—ฐ๊ตฌ๊ฐ€ ๋‚˜์•„๊ฐ€์•ผ ํ•  ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ํ†ต์ฐฐ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

A. ๋ฆฌ๋ทฐ์—์„œ ์ข…ํ•ฉ๋œ ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ธฐํšŒ

  • ๋” ๋‚˜์€ ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•: ์žฅ๊ธฐ, ๋‹ค์ค‘ ์—์ด์ „ํŠธ, ์ƒํ˜ธ์ž‘์šฉ, ์‹ค์ œ ํ™˜๊ฒฝ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋” ๋‹ค์–‘ํ•˜๊ณ , ๋Œ€๊ทœ๋ชจ์ด๋ฉฐ, ์ •ํ™•ํ•˜๊ฒŒ ์ฃผ์„ ์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์…‹ ๊ฐœ๋ฐœ.1 ์ •ํ™•๋„, ๋‹ค์–‘์„ฑ, ํƒ€๋‹น์„ฑ, ํšจ์œจ์„ฑ, ์ž‘์—… ๊ด€๋ จ์„ฑ์„ ํฌ๊ด„ํ•˜๋Š” ์ „์ฒด๋ก ์  ์„ฑ๋Šฅ์— ์ดˆ์ ์„ ๋งž์ถ˜ ํ‘œ์ค€ํ™”๋œ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ ์ •๋ฆฝ. ๊ฐœ์ธํ™” ์˜ˆ์ธก์„ ์œ„ํ•œ ํŠน์ • ๋ฒค์น˜๋งˆํฌ ๊ฐœ๋ฐœ.3
  • ์žฅ๊ธฐ ๋ฐ ์ƒํ˜ธ์ž‘์šฉ ๋ชจ๋ธ ๊ฐœ์„ : ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ, ๋ชฉํ‘œ ์ง€ํ–ฅ์  ํ–‰๋™, ๋ณต์žกํ•œ ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์ƒํ˜ธ์ž‘์šฉ์„ ๋” ์ž˜ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ์•„ํ‚คํ…์ฒ˜ ๊ฐœ๋ฐœ.1 ๊ณ„์ธต์  ๋ชจ๋ธ, ๋ฉ”๋ชจ๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜, ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ ์‚ฌ์ „ ์ง€์‹ ํƒ๊ตฌ.
  • ํšจ์œจ์ ์ด๊ณ  ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์ƒ์„ฑ ๋ชจ๋ธ: ํ™•์‚ฐ ๋ชจ๋ธ์˜ ๋” ๋น ๋ฅธ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ• ์—ฐ๊ตฌ14 ๋ฐ ์กฐ๊ฑด๋ถ€ ์˜ˆ์ธก์„ ์œ„ํ•œ ์ƒ์„ฑ ์ถœ๋ ฅ ์ œ์–ด ๋Šฅ๋ ฅ ํ–ฅ์ƒ.20 ํ”Œ๋กœ์šฐ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋Œ€์•ˆ ํƒ์ƒ‰.6
  • ํ–ฅ์ƒ๋œ ๋ฌผ๋ฆฌ์  ํ˜„์‹ค์„ฑ: ๋ฌผ๋ฆฌ์  ์‚ฌ์ „ ์ง€์‹์„ ๋ชจ๋ธ์— ๋” ๊นŠ๊ณ  ํšจ์œจ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ, ์‚ฌํ›„ ๋ณด์ •์ด๋‚˜ ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‹จ๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๋ฐฉ์•ˆ ๋ชจ์ƒ‰.8 ๋ฌผ๋ฆฌํ•™์„ ์•”์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ฌผ๋ฆฌํ•™ ํ™œ์šฉ ํƒ๊ตฌ.
  • ๊ฐœ์ธํ™” ๋ฐ ์ ์‘: ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ•์ธํ•œ ๊ฐœ์ธํ™” ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์˜จ๋ผ์ธ์—์„œ ๊ฐœ๋ณ„ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ.3
  • ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ ๋ฐ ์‹ ๋ขฐ์„ฑ (Explainability and Trustworthiness): ๋ชจ๋ธ์ด ๋ณต์žกํ•ด์ง์— ๋”ฐ๋ผ, ํŠนํžˆ ์•ˆ์ „์ด ์ค‘์š”ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ํŠน์ • ์˜ˆ์ธก์ด ์ด๋ฃจ์–ด์ง„ ์ด์œ ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ด์ง.21
  • ๊ต์ฐจ ๋ชจ๋‹ฌ ์˜ˆ์ธก (Cross-Modal Forecasting): ๊ณผ๊ฑฐ ์ž์„ธ ์™ธ์— ์žฅ๋ฉด ์ •๋ณด(์ด๋ฏธ์ง€17, 3D ์Šค์บ”15), ์˜ค๋””์˜ค, ํ…์ŠคํŠธ ์ง€์นจ๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์–‘์‹(modalities)์„ ์˜ˆ์ธก ํ”„๋กœ์„ธ์Šค์— ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉ.

์ฃผ๋ชฉํ•  ์ ์€, ์ œ์‹œ๋œ ๋งŽ์€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ๋“ค์ด ์•ž์„œ ์‹๋ณ„๋œ ์ง€์†์ ์ธ ๊ณผ์ œ๋“ค(์˜ˆ: ๋ฐ์ดํ„ฐ ํ•œ๊ณ„๋ฅผ ์œ„ํ•œ ๋” ๋‚˜์€ ๋ฒค์น˜๋งˆํฌ, ํƒ€๋‹น์„ฑ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ๋ฌผ๋ฆฌ ํ†ตํ•ฉ, ํ™•์‚ฐ ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ)์„ ์ง์ ‘์ ์œผ๋กœ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.1 ์ด๋Š” ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋‚ด์—์„œ ํ˜„์žฌ์˜ ๋ณ‘๋ชฉ ํ˜„์ƒ์— ๋Œ€ํ•œ ๊ฐ•ํ•œ ์ธ์‹์ด ์žˆ์œผ๋ฉฐ, ์—ฐ๊ตฌ๊ฐ€ ์•Œ๋ ค์ง„ ์•ฝ์ ์„ ์ ๊ทน์ ์œผ๋กœ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ํ–ฅํ›„ ๋ช‡ ๋…„ ์•ˆ์— ์ƒ๋‹นํ•œ ๋ฐœ์ „์ด ์ด๋ฃจ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” ์˜์—ญ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ๊ฐœ์ธํ™”3์™€ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ21์— ๋Œ€ํ•œ ๊ด€์‹ฌ ์ฆ๊ฐ€๋Š” ์˜ˆ์ธก ๋ชจ๋ธ์ด ํ‰๊ท ์ ์œผ๋กœ ์ •ํ™•ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŠน์ • ์ธ๊ฐ„ ์ค‘์‹ฌ ์‘์šฉ ๋ถ„์•ผ์— ๋งž๊ฒŒ ์กฐ์ •๋˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๋Š” ๋ฏธ๋ž˜๋ฅผ ์•”์‹œํ•ฉ๋‹ˆ๋‹ค. AI ์‹œ์Šคํ…œ์ด ์ธ๊ฐ„๊ณผ ๋” ๊ธด๋ฐ€ํ•˜๊ฒŒ ์ƒํ˜ธ์ž‘์šฉํ•จ์— ๋”ฐ๋ผ, ์ผ๋ฐ˜์ ์ด๊ณ  ๋ธ”๋ž™๋ฐ•์Šค์ ์ธ ๋ชจ๋ธ์€ ๋œ ์ˆ˜์šฉ ๊ฐ€๋Šฅํ•ด์งˆ ๊ฒƒ์ด๋ฉฐ, ๊ฐœ์ธ์˜ ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ทธ ํ–‰๋™์„ ์ดํ•ดํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์— ๋Œ€ํ•œ ์š”๊ตฌ๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ์—ฐ๊ตฌ๊ฐ€ ๋‹จ์ˆœํ•œ ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ์„ ๋„˜์–ด์„œ๋„๋ก ์ด๋Œ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

B. ์ข…ํ•ฉ ๊ฒฐ๋ก 

์ธ๊ฐ„ ์ž์„ธ ์˜ˆ์ธก ์—ฐ๊ตฌ๋Š” ๋‹จ๊ธฐ ๊ฒฐ์ •๋ก ์  ์˜ˆ์ธก์—์„œ ์žฅ๊ธฐ, ๋‹ค์ค‘ ์—์ด์ „ํŠธ, ํ™•๋ฅ ๋ก ์  ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํฌ๊ฒŒ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. RNN, GCN, ํŠนํžˆ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ƒ์„ฑ ๋ชจ๋ธ, ํŠนํžˆ (์ƒ์„ฑ ์ž‘์—…์—์„œ ์ฐจ์šฉ๋œ) ํ™•์‚ฐ ๋ชจ๋ธ์€ ๋ถˆํ™•์‹ค์„ฑ๊ณผ ๋‹ค์–‘์„ฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์œ ๋งํ•จ์„ ๋ณด์ด์ง€๋งŒ ํšจ์œจ์„ฑ๊ณผ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ์ธก๋ฉด์—์„œ ๊ณผ์ œ๋ฅผ ์•ˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋น„๊ต ์—ฐ๊ตฌ๋“ค์˜ ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€๋Š” ๋ชจ๋“  ์ธก๋ฉด์—์„œ ๋‹จ์ผ ๋ฐฉ๋ฒ•๋ก ์ด ์ง€๋ฐฐ์ ์ด์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๊ฐ•๋ ฅํ•œ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง์„ ์ œ๊ณตํ•˜๊ณ , GCN์€ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜๋ฉฐ, ๋ถ„๋ฆฌ ๊ธฐ๋ฒ•์€ ๋ณต์žก์„ฑ ๊ด€๋ฆฌ๋ฅผ ๋•๊ณ , ๋ฌผ๋ฆฌ ์•ˆ๋‚ด๋Š” ํ˜„์‹ค์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค. ํ‰๊ฐ€๋Š” MPJPE๋ฅผ ๋„˜์–ด์„œ๋Š” ๋‹ค๋ฉด์ ์ธ ์ ‘๊ทผ์ด ํ•„์š”ํ•˜๋ฉฐ, ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„๋Š” ์—ฌ์ „ํžˆ ์ค‘์š”ํ•œ ๋ณ‘๋ชฉ ํ˜„์ƒ์œผ๋กœ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ๋Š” ๋” ๋‚˜์€ ๋ฒค์น˜๋งˆํฌ ๊ฐœ๋ฐœ, ๋งฅ๋ฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ์ดํ•ด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๋” ์ •๊ตํ•œ ๋ชจ๋ธ ๊ฐœ๋ฐœ, ํšจ์œจ์ ์ด๊ณ  ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์ƒ์„ฑ ๊ธฐ๋ฒ• ์—ฐ๊ตฌ, ๊ทธ๋ฆฌ๊ณ  ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ๊ณผ ๊ฐœ์ธํ™”๋ฅผ ๋ณด์žฅํ•˜๋Š” ๊ฐ•์ธํ•œ ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ์— ๋‹ฌ๋ ค ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ„์•ผ๋Š” HCI, ๋กœ๋ด‡ ๊ณตํ•™, ์ž์œจ ์‹œ์Šคํ…œ ๋“ฑ ๊นŒ๋‹ค๋กœ์šด ์‘์šฉ ๋ถ„์•ผ์— ์˜ํ•ด ์ฃผ๋„๋˜๋Š” ์—ญ๋™์ ์ธ ์˜์—ญ์œผ๋กœ, ํ˜„์‹ค์„ฑ, ์ƒํ˜ธ์ž‘์šฉ, ํšจ์œจ์„ฑ, ๊ฐœ์ธํ™”์— ๋Œ€ํ•œ ์ง€์†์ ์ธ ์ถ”๊ตฌ๊ฐ€ ์—ฐ๊ตฌ ๋ฐœ์ „์„ ์ด๋Œ์–ด๊ฐˆ ๊ฒƒ์œผ๋กœ ์ „๋ง๋ฉ๋‹ˆ๋‹ค.


์ฐธ๊ณ  ์ž๋ฃŒ

  1. Jeong, H., Choi, J., & Lee, G. (2024). Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content/CVPR2024/papers/Jeong_Multi-agent_Long-term_3D_Human_Pose_Forecasting_via_Interaction-aware_Trajectory_Conditioning_CVPR_2024_paper.pdfย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10ย 11ย 12ย 13ย 14ย 15ย 16ย 17ย 18ย 19ย 20ย 21ย 22ย 23ย 24ย 25ย 26ย 27ย 28ย 29ย 30ย 31ย 32ย 33ย 34ย 35ย 36ย 37ย 38ย 39

  2. Parsaeifard, B., & Stiefelhagen, R. (2021). Learning Decoupled Representations for Human Pose Forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. https://openaccess.thecvf.com/content/ICCV2021W/SoMoF/papers/Parsaeifard_Learning_Decoupled_Representations_for_Human_Pose_Forecasting_ICCVW_2021_paper.pdfย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10ย 11ย 12ย 13

  3. Adeli, V., Shariat, N., Marin, R., Reid, I., & Salzmann, M. (2023). Personalized Pose Forecasting. arXiv preprint arXiv:2312.03528. https://arxiv.org/pdf/2312.03528ย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10ย 11ย 12ย 13ย 14ย 15ย 16ย 17

  4. Mao, W., Liu, M., & Salzmann, M. (2019). Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamics. ResearchGate. https://www.researchgate.net/publication/326206421_Long-Term_Human_Motion_Prediction_by_Modeling_Motion_Context_and_Enhancing_Motion_Dynamicsย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10ย 11ย 12ย 13ย 14ย 15ย 16ย 17

  5. Zheng, W., Liu, M., & Salzmann, M. (2023). MotionBERT: A Unified Perspective On Learning Human Motion Representations. Scribd. https://www.scribd.com/document/714836400/MotionBERT-A-Unified-Perspective-on-Learning-Human-Motion-Representationsย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10

  6. Xu, Z., Chai, J., & Lv, X. (2025). Human Motion Prediction, Reconstruction, and Generation. arXiv preprint arXiv:2502.15956. https://arxiv.org/html/2502.15956v1ย ย 2ย 3ย 4

  7. Tevet, G., et al. (2022). Human Motion Diffusion Model. OpenReview. https://openreview.net/pdf?id=SJ1kSyO2jwuย ย 2ย 3ย 4ย 5ย 6ย 7

  8. Yuan, Y., Rempe, D., Liu, Z., Wang, T., Snavely, N., & Black, M. J. (2023). PhysDiff: Physics-Guided Human Motion Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://openaccess.thecvf.com/content/ICCV2023/papers/Yuan_PhysDiff_Physics-Guided_Human_Motion_Diffusion_Model_ICCV_2023_paper.pdfย ย 2ย 3ย 4ย 5ย 6ย 7ย 8ย 9ย 10ย 11ย 12ย 13ย 14ย 15ย 16ย 17ย 18ย 19ย 20ย 21ย 22ย 23ย 24

  9. Li, T., et al. (2024). AAMDM: Accelerated Auto-regressive Motion Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content/CVPR2024/html/Li_AAMDM_Accelerated_Auto-regressive_Motion_Diffusion_Model_CVPR_2024_paper.htmlย ย 2ย 3ย 4ย 5

  10. T2P ๋ชจ๋ธ ๊ด€๋ จ ๋…ผ๋ฌธ (๋งํฌ ํ•„์š”)ย ย 2ย 3ย 4ย 5

  11. Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models. (2025). MDPI. https://www.mdpi.com/2079-9292/14/3/605ย 

  12. Review of models for estimating 3D human pose using deep learning. (2025). PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC11888865/ย ย 2

  13. Kjellstrรถm, H., et al. (2021). The THร–R dataset: A dataset for human-robot interaction. ร–rebro University. http://oru.diva-portal.org/smash/get/diva2:1524236/FULLTEXT01.pdfย ย 2ย 3ย 4ย 5ย 6

  14. EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation. (2023). arXiv preprint arXiv:2312.02256. https://arxiv.org/html/2312.02256v3ย ย 2ย 3ย 4

  15. Harmonizing Stochasticity and Determinism: Scene-responsive Diverse Human Motion Prediction. (2024). OpenReview. https://openreview.net/forum?id=NQCkNM6TESย ย 2

  16. ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model. (2024). arXiv preprint arXiv:2410.07296. https://arxiv.org/html/2410.07296v1ย ย 2

  17. Move-in-2D: 2D-Conditioned Human Motion Generation. (2024). arXiv preprint arXiv:2412.13185. https://arxiv.org/html/2412.13185v1ย ย 2

  18. Martinez, J., Black, M. J., & Romero, J. (2017). On Human Motion Prediction Using Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://typeset.io/papers/on-human-motion-prediction-using-recurrent-neural-networks-2i62b0kvs1ย 

  19. A Survey on Deep Learning-Based 2D Human Pose Estimation Models. (2023). Tech Science Press. https://www.techscience.com/cmc/v76n2/53975/htmlย 

  20. Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion Synthesis. (2023). arXiv preprint arXiv:2305.13773. https://arxiv.org/html/2305.13773v2ย 

  21. Neuro-Symbolic AI in 2024: A Systematic Review. (2025). arXiv preprint arXiv:2501.05435. https://arxiv.org/html/2501.05435v1ย ย 2