3474085.3475613.pdf

VASTile: Viewport Adaptive Scalable 360-Degree Video Frame Tiling Chamara Madarasingha The University of Sydney Sydney, Australia ckat9988@uni.sydney.edu.au Kanchana Thilakarathna The University of Sydney Sydney, Australia kanchana.thilakarathna@sydney.edu.au ABSTRACT 360 ° videos a.k.a. spherical videos are getting popular among users nevertheless, omnidirectional view of these videos demands high bandwidth and processing power at the end devices. Recently proposed viewport aware streaming mechanisms can reduce the amount of data transmitted by streaming a limited portion of the frame covering the current user viewport (VP). However, they still suffer from sending a high amount of redundant data, as the fixed tile mechanisms can not provide a finer granularity to the user VP. Though, making the tiles smaller can provide a finer granularity for user viewport, it will significantly increase encoding-decoding overhead. To overcome this trade-off, in this paper, we present a computational geometric approach based adaptive tiling mech- anism named VASTile , which takes visual attention information on a 360 ° video frame as the input and provides a suitable non- overlapping variable size tile cover on the frame. Experimental results show that VASTile can save up to 31.1% of pixel redundancy before compression and 35.4% of bandwidth saving compared to recently proposed fixed tile configurations, providing tile schemes within 0.98 ( ± 0 11 ) seconds time frame. CCS CONCEPTS • Human-centered computing → Virtual reality KEYWORDS 360-Degree/VR video; Viewport awareness; Frame tiling ACM Reference Format: Chamara Madarasingha and Kanchana Thilakarathna. 2021. VASTile : View- port Adaptive Scalable 360-Degree Video Frame Tiling. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475613 1 INTRODUCTION Over the past decade, video streaming has been dominating the global internet traffic accounting for 80% of total traffic [ 21 ]. Among them, 360 ° videos, a.k.a. spherical videos, are becoming increasingly popular as it enable immersive streaming experience. Along with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’21, October 20–24, 2021, Virtual Event, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3475613 the advancement of Head Mount Devices (HMDs) (e.g., Facebook Oculus [ 6 ], Microsoft Hololens [ 14 ]) and smartphones, content providers (e.g., YouTube (YT) [ 32 ] and Facebook (FB) [ 7 ]) have al- ready facilitated both on-demand and live-streaming of 360 ° videos. Intrinsic larger spherical view of 360 ° videos has posed several challenges against providing expected high user Quality of Experi- ence (QoE). Firstly, 360 ° videos demand high network bandwidth since the spherical video frames should be 4–6 times larger than normal video frames to achieve the same user perceived quality [ 20 ]. Secondly, processing such video frames at the end devices imposes high resource utilization, which is problematic particularly with the limited resources such as in smartphones. Finally, 360 ° video streaming requires strictly low latency, which should be within the motion-to-photon delay 1 (just above 25ms), otherwise the user may suffer from severe motion or cyber-sickness [12]. Recently proposed viewport (VP)–current visible area, typically encompasses 100 ° × 100 ° Field of View (FoV)– adaptive stream- ing [ 28 ], which streams only a selected portion of the 360 ° video frame, has shown a great promise in addressing the above issues. Especially, the tiling mechanism, which divides the entire frame into tiles and send only the tiles within the user VP can reduce the amount of data transferred to the client and increases the user perceived video quality compared to streaming entire video frame [ 9 , 20 , 29 , 30 ]. However, due to the fix number of tiles (typi- cally ranges between (24–36) [ 9 , 20 ]) and their fixed sizes, optimum gain of bandwidth saving is not achieved by these mechanisms. On the one hand, fixed tiling schemes fail to provide finer boundary to the user FoV, and therefore, there is a high pixel redundancy [ 28 ]. On the other hand, they do not have awareness of visually attrac- tive regions when encoding the tiles. For example, corresponding regions of the polar regions in the equirectangular projection (ERP) frame are highly distorted and have less viewing probability. How- ever, fixed tiling encodes these regions at the same bitrate levels and in smaller tile size as in the equatorial regions, adding unneces- sary quality overhead and losing many compression opportunities compared to having bigger tiles [10, 28]. To this end, provisioning a viewport aware adaptive tiling mech- anism with variable sized tiles can provide fine granularity for the FoV and also increases compression gain. This enables con- tent providers to reduce encoding overhead incurred by tiling the videos and identify the tiles which should be in high quality at a fine granularity. At the network provider side, transmitted data volume will be further decreased reducing the bandwidth consumption for 360 ° video streaming while providing opportunities to re-innovate 1 Delay between user head movement and the finishing rendering the corresponding VR frame on to the display Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4555 caching mechanisms for 360 ° videos. Finally, existing DASH (Dy- namic Adaptive Streaming over HTTP) protocols on 360 ° videos can be enhanced to provide better user QoE at the client side. Achieving such an adaptive tiling mechanism is challenging. First, tile scheme generating algorithms should not demand high processing power, as the servers are already in high utilization to support excessive demand of other video streaming services. Sec- ond, algorithms themselves should process in a minimal time period to scale up the solution. However, recently proposed dynamic tiling solutions [ 19 , 28 , 34 ] compromise both aspects by encoding all pos- sible tile schemes for a given frame even in more than the quality levels given in DASH protocols [ 19 ] and exploiting time consuming Integer Liner Programming (ILP)/exhaustive searching methods. In this paper, we propose VASTile ( V iewport A daptive S calable 360-degree Video Frame Til(e) ing), an adaptive tiling mechanism for 360 ° video frames supported by dynamic nature of user VPs in 360 ° video streaming. In VASTile , we leverage a computational geometric approach, which we call Minimal Non-overlapping Cover (MNC) algorithm [ 24 , 25 ], to devise a suitable tile scheme by par- titioning rectilinear polygons generated by combining basic tiles from 10 × 20 grid overlaid on the 360 ° video frames. To generate the rectilinear polygons, VASTile consists with semi-automated thresh- olding mechanism, which divides the 360 ° video frame into multiple sub regions based on the visual attraction level, i.e., saliency, of pixels of the frame. Moreover, taking FoV distortions on the ERP frame and removing potential overlaps, VASTile further reduces the downloaded data volume, transmitted pixel redundancy and processing time for end-to-end tile generation process. We leverage 30, 360 ° videos in 1 min duration with 30 user VP traces in each to build VASTile and validate its performance. Our experimental results show that VASTile can save up to 31.1% of pixel redundancy before compression and 35.4% of bandwidth saving compared to recently proposed fixed tile configurations. Moreover, circumventing the time consuming process of encoding all possible tile combinations for exhaustive/ILP based searching algorithms, VASTile is able to generate suitable tile scheme with in avg. of 0.98 ( ± 0 11 ) s processing time with at least 80% of individual user VP coverage in high quality. 2 2 RELATED WORK VP-aware 360 ° video streaming: Plethora of works have been done in VP-aware 360 ° video streaming optimization [ 1 , 9 , 20 , 22 , 29 , 30 ]. In these mechanisms, a predicted user VP is sent to the content servers and a selected portion of the frame covering the requested VP is transmitted to the client. The most prominent way of selecting the required portion is first, dividing the entire frame into fixed number of tiles and select only the tiles fall within the user VP [ 9 , 20 , 29 , 30 ]. The overarching goal of VP aware streaming is to reduce the amount of data transmitted through the network and to increase the quality of the streamed portion to increase the user perceived QoE. A major drawback of these proposals is that tiles in fixed size transmit high amount of redundant data as they can not provide finer boundary to the user FoV. Although, smaller size tiles can create a finer boundary, they increase the encoding overhead resulting in higher bandwidth consumption [10, 28]. 2 Artefacts of the work are available at https://github.com/manojMadarasingha/VASTile Adaptive tiling schemes on 360 ° video frames: In contrast to the uniform size tiling, variable size tile schemes are proposed in [ 11 , 33 ]. They divide the frame into fixed no. of tiles but vary their size according to the latitude. Diversely, combining set of basic tiles in fixed minimum size to form larger tiles in variable amount and size are presented in [ 19 , 28 , 34 ]. Both [ 28 ] and [ 34 ] leverage ILP based solutions to find the best tile configurations taking the server side storage size and streaming data volume as the cost functions. Ozcinar et al. present a exhaustive searching method to derive tiles while allocating a given bitrate budget [ 19 ]. These approaches are lack of scalability due to two reasons. First, they require to encode all possible tile schemes, which may exceed 30000 of solutions [ 28 ] incurring high encoding time. Second, algorithms such as exhaustive searching/ILP itself need longer processing time. In contrast to the above methods, we propose a scalable, adaptive tiling mechanism leveraging a computational geometric approach, which can provide high quality tiles in user VP while reducing the pixel redundancy before compression and streamed volume of data. 3 BACKGROUND AND MOTIVATION Region Boundary with high visual attention Tile boundary of partitioned polygons by MNC 1.0 0.0 Polygon 1 Polygon 2 Polygon 3 viewing probability (a) Partitioning regions with high visual attention using the basic MNC algorithm 5x10 10x15 10x20 4x4 4x6 6x6 5x10 10x15 10x20 0 50 100 150 200 250 300 % Pixel red. before compress. MNC algo Fixed 0 10 20 30 40 50 Avg. # of tiles per frame # of Tiles (b) % pixel redundancy before compression & no. of tiles per frame Figure 1: MNC based partitioning and Comparison with fixed tile configurations VASTile aims to partition video frames leveraging visual atten- tion level of each pixel in the frame. Generating visual attention maps has been well studied in the literature [ 15 , 18 , 19 , 28 ], which are often developed by either analysing content features [ 15 , 18 ] or/and by analysing past VP traces of the video [ 18 , 19 , 28 ]. The primary challenge addressed in this work is how we find optimal partitioning of frames given a visual attention map. For this purpose, VASTile leverages computational geometric approach proposed in [ 25 , 27 ], a.k.a. Minimal Overlapping Cover (MNC) . We now present basics of the MNC algorithm and its applicability for 360 ° video frame partitioning showing initial results of reduction in % pixel redundancy before compression compared to fixed tiling schemes. 3.0.1 MNC algorithm. The original MNC algorithm creates a rect- angular tile configuration covering a given polygon region with the fewest number of tiles in variable size. In brief, the algorithm runs on hole-free rectilinear polygons, first taking the concave (and convex) vertices in polygon boundary to find the maximum independent chords. These independent chords divide the polygon into sub-rectilinear polygons, which can be further partitioned by drawing chords (vertically/horizontally) from the concave vertices from which a chord has not been drawn yet. We refer the interested users to [25, 27] for detail implementation of MNC algorithm. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4556 Pre - processing Partitioning Post - processing 𝑅 𝐹𝑜𝑉 + 𝑅 ( 𝐹𝑜𝑉 ! ) Further partitioning 𝑅 ( 𝐵𝑢𝑓 ) 𝑅 ( 𝑂𝑜𝑉 ) Bit rate allocation User viewport Sampling Approximate thresholding ( 𝛼 ) Blob detection Finer thresholding ( 𝜁 ) Buff. region thresholding ( 𝛽 ) Blob detection ( Buff.) Tile quality increases VM generation a b c d e f g h i j k l 𝑉𝑀 𝑉𝑀 ( 𝛼 ∗ ) 𝑉𝑀 ( 𝛼 ∗ , 𝑏 ) 𝑉𝑀 ( 𝛼 ∗ , 𝑏 , 𝜁 ) 𝑉𝑀 "#$ ( 𝛽 ∗ ) 𝑉𝑀 "#$ ( 𝛽 ∗ , 𝑏 ) Further partitioned tiles (a) Pre-processing, Partitioning and Post-processing steps which take individual user VPs as the input and output a suitable tile scheme on which variable bitrate can be allocated. 𝑅 ( 𝑂𝑜𝑉 ) 𝑅 ( 𝐹𝑜𝑉 ! ) 𝑅 ( 𝐹𝑜𝑉 ) 𝑅 ( 𝐵𝑢𝑓 ) (b) 4 FoV regions considered in frame partitioning Figure 2: Overall architecture of VASTile and 4 main regions identified in a given frame 3.0.2 Applicability of MNC algorithm for partitioning. Fig. 1a shows that on the visual attention maps generated using past VP traces, we can create rectilinear polygons surrounding the regions with high VP probability. These polygons, which are comprised of set of basic tiles (i.e., the smallest tile that can not be be further partitioned denoted as BT) can be partitioned by the MNC algorithm as in Section 3.0.1. However, basic MNC algorithm is unaware of visual attention. Leveraging simple pre/post-processing steps on the video frame, we can inject viewport awareness to the MNC algorithm converting the process for a quality adaptive tiling, which is further elaborated in Section 4.1. 3.0.3 Comparison with fixed tile configuration. We run the MNC algorithm on individual user VPs 3 and compare the generated tile schemes with fixed tile configurations [ 9 , 13 , 20 ] measuring % pixel redundancy before compression. We leverage VP maps of randomly selected 5 users from 5 sample videos representing the categoriza- tion provided in [ 3 ] 4 . We consider three basic tile (BT) configura- tions ( 5 × 10 , 10 × 15 and 10 × 20 ) to generate rectilinear polygons as the input to the MNC algorithm. We compare the derived tile schemes with five fixed tile configurations ( 4 × 6 , 6 × 6 , 5 × 10 , 10 × 15 and 10 × 20 ) applied on the same user VPs. We measure % pixel redundancy before compression as in Eq. 1. 𝑃𝑖𝑥𝑒𝑙 𝑟𝑒𝑑. 𝑏𝑒 𝑓 𝑜𝑟𝑒 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑁 𝑇 − 𝑁 𝐹𝑜𝑉 𝑁 𝐹𝑜𝑉 % (1) where 𝑁 𝑇 and 𝑁 𝐹𝑜𝑉 represent number of non-zero pixels in tiles which overlap with the user FoV and actual FoV size respectively. Fig. 1b shows the analysis results. We see that 10 × 20 configuration results in lowest % pixel redundancy for both MNC partitioning and fixed tile approach. However, to cover the same single user VP, 10 × 20 configuration in fixed tiling requires × 25 the no. of tiles generated by the MNC algorithm. Therefore, compared to the MNC algorithm, 3 Unlike the visual attention maps, which combines multiple user VPs, an individual user VP is a binary map representing FoV by 1 and outside the FoV by 0. 4 [ 3 ] put videos in into 5 categories: riding, moving focus, exploration, rides and miscellaneous (a combination of previous types) more encoding overhead incurred on fixed tile configurations as the tile sizes are smaller and higher in amount [10, 28]. 4 DESIGN OF VASTILE FRAMEWORK 4.1 VASTile overview Fig. 2a shows the VASTile overview. The objective of Pre-processing step is to identify different regions in a frame according to the expected visual attention in order to inject viewport awareness to the MNC algorithm. We first create averaged Viewport Map (VM) combining individual user VP maps and then apply a hierarchical thresholding mechanism to detect visual attention blobs 5 for 4 re- gions, as depicted in Fig. 2b. The first two regions are defined on the area, which cover at least 80% of the user VPs namely, 𝑅 ( 𝐹𝑜𝑉 𝑓 ) , which covers the most attractive regions and 𝑅 ( 𝐹𝑜𝑉 ) covering the remaining of the extracted area. We define 𝑅 ( 𝐵𝑢 𝑓 ) as an additional buffer region to cover VPs, which are slightly deviated from the 𝑅 ( 𝐹𝑜𝑉 𝑓 ) and 𝑅 ( 𝐹𝑜𝑉 ) , which is likely to be outside the region cover- ing at least 80% of user VPs. Finally, 𝑅 ( 𝑂𝑜𝑉 ) is the remaining area of 𝑉 𝑀 with the lowest viewing probability. Details of Pre-processing steps are presented in Section 4.2. In Partitioning step, we run the MNC algorithm on the above 4 regions separately. We create rectilinear polygon boundaries around the blobs in those regions and partition them into non-overlapping tiles denoted as DT (derived tile)s comprised of set of BT (basic tile)s. Since we consider blobs in VM frame separately, there can be overlaps between DTs, if the considered blobs are closer to each other. Therefore, we remove such overlaps during partitioning. Details of Partitioning steps are presented in Section 4.3. In Post-processing step, we further split DTs, which are larger than FoV ( 100 ° × 100 ° ), considering its distortion when projecting on to the Equirectangular (ERP) Frame. Finally, bitrate can be allocated to each DT considering the tile properties such as average pixel intensity, tile size and tile location. Details of Post-processing steps are presented in Section 4.4. 5 Regions with (near)concentric user VPs Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4557 Here onwards, We denote a given user, video and no. of users for the video by 𝑖 , 𝑗 and 𝑢 𝑗 respectively. Pixel coordinates of video frames are denoted as ( 𝑚, 𝑛 ) , where 0 ≤ 𝑚 < 𝐻 , 0 ≤ 𝑛 < 𝑊 ( 𝐻 and 𝑊 are frame height and width respectively). 60 ° 30 ° 0 ° 60 ° 30 ° Mid peripheral Near peripheral Far peripheral Mid peripheral Near peripheral Safety margin: 5 ° Frame gap (s) Figure 3: (left): Angle diff. vs the frame gap. (right): Periph- eral vision of human eye 4.2 Pre-processing We now describe the frame pre-processing pipeline, which identifies salient regions to be partitioned by the MNC algorithm. 4.2.1 Frame sampling (Fig. 2a-a). Frame rate for videos can vary between 24–60 fps, while 30 fps is common in general. It is not necessary to partition each and every frame in a video because (i) it is safe to assume that FoV of the user is fixed at a certain position for certain period [ 18 ], (ii) having different tile scheme for each frame can reduce compression gain in encoding, and (iii) based on the above two facts, running a partitioning algorithm on every frame adds unnecessary computational cost. To decide the most suitable frame gap, we analyse the relation- ship between the temporal and spatial user VP behavior in light of the peripheral vision of human eye. Fig. 3 shows the angular differ- ence of yaw and pitch distribution against the sampling frame gaps from 0.1–2.0s by every 0.1s. Human vision can perceive high quality only at near peripheral region , i.e., within 30 ° range, [ 2 , 4 , 23 ]. Based on that fact, we make a fair assumption that from a fixation point, the user can view with almost the same visual quality at maximum of 30 ° without changing the fixation point. According to practical VP traces, up to 0.8s of frame gap can tolerate 30 ° angle difference in Yaw direction. Including a safety margin of 5 ° , we decide 0.5s as the frame gap to refresh the tile scheme without making a significant harm to the user perceived quality. Compared to 1s tile scheme re- freshing period found in literature [ 19 , 28 ], which can leads (≥ 35 ° ) angle difference, 0.5s gap can better adapt to FoV changes. Also, it reduces the encoding overhead and time for processing if we are to consider every single frame for partitioning. 4.2.2 Viewport Map (VM) generation (Fig. 2a-b). Despite many dif- ferent approaches for generating visual attention maps, we leverage historical user VP traces. [ 19 ] claims that 17 users are sufficient to create representative VP map. Therefore, we consider 20 users from each video to generate overall visual attention regions. First, given the centre of VP: 𝑐 𝑖 , of each user 𝑖 in < 𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ > angles, we create a binary map, 𝑉 𝑖 according to Eq. 2. 𝑉 𝑖 ( 𝑚, 𝑛 ) = { 1 , if ( 𝑚, 𝑛 ) ∈ 𝐹 𝑖 0 , otherwise (2) We assume a 100 ° × 100 ° FoV area ( 𝐹 𝑖 ) representing the FoV of the majority of commercially available HMDs [ 3 ]. We also project spherical coordinates of pixels ( 𝑥, 𝑦 ) to ERP format ( 𝑚, 𝑛 ) con- sidering the geometrical distortion, creating more dispersed pixel distribution towards the upper and bottom parts of the frame (i.e., corresponding to the polar region of the spherical frame). We then, average 𝑉 𝑖 from all users, 𝑢 𝑗 . The resulting frame is histogram equalized and normalized by dividing by the maximum pixel value (=255). We denote this processed frame as 𝑉 𝑀 ( 𝛼 ) (a) 𝛼 validation ( 𝛽 ) (b) 𝛽 validation Figure 4: % user viewport distribution on thresholded area under different threshold values for both approximate ( 𝛼 ) and buffer ( 𝛽 ) thresholding Algorithm 1 Determine 𝛼 ∗ 1: Input 2: 𝑉 𝑀 Normalized VP map 3: { 𝑉 𝑖 } Set of individual user VP maps ∀ 𝑖 ∈ [ 1 , 𝑢 𝑗 ] 4: Vaiable 5: 𝑉 𝑀 ( 𝛼 ) Binary map of 𝑉 𝑀 after thresholding by 𝛼 6: 𝐼 𝑖 ( 𝛼 ) Intersection map between 𝑉 𝑖 and 𝑉 𝑀 ( 𝛼 ) 7: 𝑠 𝑖 ( 𝛼 ) % overlap between 𝑉 𝑖 and 𝐼 𝑖 ( 𝛼 ) 8: 𝑆 ( 𝛼 ) Set containing 𝑠 𝑖 ( 𝛼 ) ∀ 𝑖 ∈ [ 1 , 𝑢 𝑗 ] 9: 𝑠 𝑎𝑣𝑔 ( 𝛼 ) Avg. of 𝑠 𝑖 ( 𝛼 ) for all users 𝑢 𝑗 10: for 𝛼 = 0.4 to 0.7, step = 0.1 do 11: 𝑉 𝑀 ( 𝛼 ) = { 1 , if 𝑉 𝑀 ( 𝑚, 𝑛 ) ≥ 𝛼 0 , otherwise 12: for 𝑖 = 1 to 𝑢 𝑗 do 13: 𝐼 𝑖 ( 𝛼 ) = 𝑉 𝑀 ( 𝛼 ) ∩ 𝑉 𝑖 ⊲ get intersection map 14: 𝑠 𝑖 ( 𝛼 ) = Í Í 𝐼 𝑖 ( 𝛼 ) Í Í 𝑉 𝑖 % ∀ 𝑚 ∈ [ 0 , 𝐻 ) , ∀ 𝑛 ∈ [ 0 ,𝑊 ) , 15: 𝑆 ( 𝛼 ) .𝑎𝑑𝑑 ( 𝑠 𝑖 ( 𝛼 )) ⊲ store the % overlap user 𝑖 16: 𝑠 𝑎𝑣𝑔 ( 𝛼 ) = 1 𝑢 𝑗 Í 𝑢 𝑗 𝑖 = 1 𝑠 𝑖 ( 𝛼 ) , s.t. 𝑠 𝑖 ( 𝛼 ) ∈ 𝑆 ( 𝛼 ) 17: if 𝑠 𝑎𝑣𝑔 ( 𝛼 ) < 80% then ⊲ check for 80% coverage 18: if 𝛼 > 0 4 then 19: 𝛼 ∗ = 𝛼 − 0 1 20: else ⊲ if none of the 𝑇ℎ 𝑎 satisfy 80% coverage 21: 𝛼 ∗ = 0 4 22: return 𝛼 ∗ 23: return 𝛼 ∗ = 0 7 ⊲ if 𝛼 = 0 7 satisfy the 80% coverage 4.2.3 Approximate Thresholding (Fig. 2a-c). The aim of this step is to filter 𝑅 ( 𝐹𝑜𝑉 𝑓 ) + 𝑅 ( 𝐹𝑜𝑉 ) regions representing at least 80% of the user VPs. We apply Approximate threshold ( 𝛼 ∗ ) selected from discrete set of 𝛼 values based on VPs distribution on 𝑉 𝑀 s to ex- tract the above two regions. First, to determine possible set of 𝛼 values, we calculate overlap between individual user VPs and the Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4558 thresholded regions for discrete set of 𝛼 applied on 𝑉 𝑀 s. Fig. 4a shows that except 0.8, all other values can cover at least 80% of FoV region at least for one frame. We also note that 80% margin covers VPs of at least 17 users, out of 20 users, which is claimed to be the minimum number of user VPs needs to generate representative 𝑉 𝑀 [19]. Thus, we reduce the range for 𝛼 between 0.4 to 0.7. We present Algorithm 1 to select the highest possible 𝛼 = 𝛼 ∗ for each frame because higher the 𝛼 , region boundaries are smoother reducing the complexity in partitioning. But, to avoid losing im- portant salient regions, we constrain that 𝛼 should threshold at least 80% of VP coverage. First, we create binary map 𝑉 𝑀 ( 𝛼 ) , by thresholding the 𝑉 𝑀 using a selected 𝛼 (line 11). Then we measure the % of FoV overlap of each user VP ( 𝑉 𝑖 ) with 𝑉 𝑀 ( 𝛼 ) calculating an average value, 𝑠 𝑎𝑣𝑔 ( 𝛼 ) (line 12-16). Finally, if the 𝑠 𝑎𝑣𝑔 ( 𝛼 ) is no more giving 80% coverage, we stop further processing and assign 𝛼 ∗ with the previous 𝛼 value. If none of the 𝛼 satisfy 80% VP coverage, we select 𝛼 ∗ = 0 4 (line 17-23). We apply 𝛼 ∗ on 𝑉 𝑀 and denote the resulting frame with 𝑅 ( 𝐹𝑜𝑉 𝑓 ) + 𝑅 ( 𝐹𝑜𝑉 ) as 𝑉 𝑀 ( 𝛼 ∗ ) Algorithm 2 Select blobs from 𝑉 𝑀 ( 𝛼 ∗ ) 1: Input 2: 𝑉 𝑀 ( 𝛼 ∗ ) Approximate thresholded frame 3: Vaiable 4: 𝑏 𝑙 𝑙 𝑡ℎ blob in 𝑉 𝑀 ( 𝛼 ∗ ) , 𝑙 ∈ [ 1 , 𝑙 𝑚𝑎𝑥 ] , 𝑙 𝑚𝑎𝑥 : maxi- mum no. of blobs in 𝑉 𝑀 ( 𝛼 ∗ ) 5: 𝑧 𝑙 Size of 𝑙 𝑡ℎ blob 6: 𝐵 Set containing all the blobs from 𝑉 𝑀 ( 𝛼 ∗ ) 7: 𝐵 𝑠𝑒𝑙 Set containing the selected blobs 8: 𝑍 𝑠𝑒𝑙 Total size of a selected blobs from 𝐵 𝑠𝑒𝑙 9: 𝑍 𝑉 𝑀 ( 𝛼 ∗ ) Total thresholded area of 𝑉 𝑀 ( 𝛼 ∗ ) 10: 𝐵 ← 𝐺 ( 𝑉 𝑀 ( 𝛼 ∗ )) ⊲ get all the blobs to set 𝐵 11: 𝐵 𝑠𝑜𝑟𝑡 ← 𝑠𝑜𝑟𝑡 ( 𝐵 ) : (descending order of blob size) 12: 𝑍 𝑠𝑒𝑙 = 0 13: while 𝑙 < 𝑙 𝑚𝑎𝑥 do 14: 𝐵 𝑠𝑒𝑙 .𝑎𝑑𝑑 ( 𝑏 𝑙 ) ⊲ Add blobs to 𝐵 𝑠𝑒𝑙 selected from 𝐵 𝑠𝑜𝑟𝑡 15: 𝑍 𝑠𝑒𝑙 = 𝑍 𝑠𝑒𝑙 + 𝑧 𝑙 ⊲ add blob size cumulatively 16: if ( 𝑍 𝑠𝑒𝑙 / 𝑍 𝑉 𝑀 ( 𝛼 ∗ ) ) % ≥ 95 then ⊲ check for 95% coverage 17: break 18: 𝑙 = 𝑙 + 1 ⊲ increment the blob count 19: 𝑉 𝑀 ( 𝛼 ∗ , 𝑏 ) ← 𝐻 ( 𝐵 𝑠𝑒𝑙 ) ⊲ create pixel map from selected blobs 20: return 𝑉 𝑀 ( 𝛼 ∗ , 𝑏 ) 4.2.4 Blob detection (Fig. 2a-d). Due to the non-uniform disper- sion of the user VPs, 𝑉 𝑀 can contain multiple blobs. Aim of this step is to identify these blobs and exclude non-significant small blobs reducing the complexity of partitioning process. Without loss of generality, we select the blobs in 𝑉 𝑀 ( 𝛼 ∗ ) covering at least 95% of 𝑅 ( 𝐹𝑜𝑉 ) + 𝑅 ( 𝐹𝑜𝑉 𝑓 ) . Algorithm 2 summarizes the blob selec- tion process given the 𝑉 𝑀 ( 𝛼 ∗ ) as the input. Firstly, the function 𝐺 ( 𝑉 𝑀 ( 𝛼 ∗ )) outputs all the blobs in 𝑉 𝑀 ( 𝛼 ∗ ) as a set, 𝐵 , followed by sorting in descending order according to the blob size (line 10-11). After that, we cumulatively sum up the blob size starting from the largest one and stop the process, when the total selected blobs size ( 𝑍 𝑠𝑒𝑙 ) exceeds 95% of total thresholded area ( 𝑍 𝑉 𝑀 ( 𝛼 ∗ ) ) in 𝑉 𝑀 ( 𝛼 ∗ ) (line 12-18). Finally, a map, 𝑉 𝑀 ( 𝛼 ∗ , 𝑏 ) , is created combining all the selected blobs using the function 𝐻 ( 𝐵 𝑠𝑒𝑙 ) (line 19). 4.2.5 Finer thresholding (Fig. 2a-e). To provide a higher quality for the most attractive regions, in this step, we filter 𝑅 ( 𝐹𝑜𝑉 𝑓 ) from 𝑉 𝑀 ( 𝛼 ∗ , 𝑏 ) by defining Finer threshold , 𝜁 . Without loss of generality, we set 𝜁 = 0 9 to identify the 𝑅 ( 𝐹𝑜𝑉 𝑓 ) region boundaries. We expand these boundaries to generate perfect rectangular polygons as we discussed in Section 4.3. We denote the finer thresholded frame as 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) , which is the input for MNC algorithm for 𝑅 ( 𝐹𝑜𝑉 ) + 𝑅 ( 𝐹𝑜𝑉 𝑓 ) partitioning. 4.2.6 Buffer region thresholding and blob detection (Fig. 2a-f and 2a- g). The objective of this step is to filter 𝑅 ( 𝑏𝑢 𝑓 ) , which covers slight variations of user VPs. We extract 𝑅 ( 𝐵𝑢 𝑓 ) from the area not cov- ered by the DT (derived tile)s from 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) on the initial 𝑉 𝑀 which is denoted as 𝑉 𝑀 𝑏𝑢 𝑓 (Eq. 3). We use the same DT informa- tion to obtain corresponding buffer regions in 𝑉 𝑖 (individual user viewports) namely, 𝑉 𝑏𝑢 𝑓 𝑖 , as in Eq. 4. 𝑉 𝑀 𝑏𝑢 𝑓 ← 𝑉 𝑀 ∩ ( 𝑉 𝑀 𝑇 ( 𝛼 ∗ , 𝑏, 𝜁 )) ′ (3) 𝑉 𝑏𝑢 𝑓 𝑖 ← 𝑉 𝑖 ∩ ( 𝑉 𝑀 𝑇 ( 𝛼 ∗ , 𝑏, 𝜁 )) ′ ∀ 𝑖 ∈ [ 1 , 𝑢 𝑗 ] (4) where 𝑉 𝑀 𝑇 ( 𝛼 ∗ , 𝑏, 𝜁 ) and ( 𝑉 𝑀 𝑇 ( 𝛼 ∗ , 𝑏, 𝜁 )) ′ denote the DT overlay on 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) by the MNC algorithm and its complement. In order to extract a suitable buffer region from 𝑉 𝑀 𝑏𝑢 𝑓 , we ap- ply Buffer threshold ( 𝛽 ∗ ) and Blob detection as the same way we followed in 𝛼 ∗ finding and Blob detection in approximate thresh- olding (see Section 4.2.3 and 4.2.4). Hence, we compute % user viewport ( 𝑉 𝑏𝑢 𝑓 𝑖 ) covered by the thresholded region from 𝑉 𝑀 𝑏𝑢 𝑓 for the 𝛽 ∈ { 0 1 , 0 2 , 0 3 , 0 4 } . Fig. 4b shows that all threshold values can provide ( ≥ 80% ) 6 buffer viewport coverage, and therefore, we dynamically select 𝛽 = 𝛽 ∗ by exploiting the above set. We apply Algorithm 1 simply changing the threshold values (line 10) and replacing 𝑉 𝑀 (line 2 & 11) and 𝑉 𝑖 (line 3 & 13–14) with 𝑉 𝑀 𝑏𝑢 𝑓 and 𝑉 𝑏𝑢 𝑓 𝑖 respectively. After finding 𝛽 ∗ , we define thresholded buffer frame as 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ ) . After that, to exclude the non-significant smaller blob regions, we apply Algorithm 2 on 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ ) by sim- ply replacing 𝑉 𝑀 ( 𝛼 ∗ ) with 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ ) (line 2 & 10). We denote blob filtered buffer frame as 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ , 𝑏 ) 4.2.7 OoV extraction. The goal of this step is to extract 𝑅 ( 𝑂𝑜𝑉 ) to derive low quality DTs to satisfy any anomaly user VP. We filter out 𝑅 ( 𝑂𝑜𝑉 ) removing the area covered by DT overlay on 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) + 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ , 𝑏 ) area (similar to Eq. 3). No further pre- processing is applied to 𝑅 ( 𝑂𝑜𝑉 ) region as no significant pixel value distribution is observed. We denote the OoV region as 𝑉 𝑀 𝑜𝑜𝑣 4.3 Partitioning VASTile frame partitioning step runs the MNC algorithm on 𝑅 ( 𝐹𝑜𝑉 𝑓 ) + 𝑅 ( 𝐹𝑜𝑉 ) , 𝑅 ( 𝐵𝑢 𝑓 ) and 𝑅 ( 𝑂𝑜𝑉 ) regions separately to gen- erate DTs. We start with creating a rectilinear polygon covering each blob followed by running basic MNC algorithm in Section 3.0.1. We select basic tile configuration as 10 × 20 , nevertheless, VASTile supports the flexible quad tree partitioning structure in H265 with maximum coding unit (CU) of 64 × 64 [ 31 ] preserving the coding efficiency. Firstly, for 𝑅 ( 𝐹𝑜𝑉 ) and 𝑅 ( 𝐹𝑜𝑉 𝑓 ) partitioning (Fig. 2a-h), we lever- age 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) . Fig. 5 shows ( 𝑅 ( 𝐹𝑜𝑉 ) + 𝑅 ( 𝐹𝑜𝑉 𝑓 ) ) partitioning 6 Covering corresponding 𝑉 𝑏𝑢 𝑓 𝑖 from at least 17 users out of 20 Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4559 process. We expand the detected polygon in 𝑅 ( 𝐹𝑜𝑉 𝑓 ) (i.e., polygons in blue color) converting to a perfect rectangle. The boundary is extended to the minimum and the maximum ( 𝑚, 𝑛 ) locations as long as it does not exceed the 𝑅 ( 𝐹𝑜𝑉 ) polygon boundary as shown in red color arrows. By this step, we make the partitioning process simpler and create an extra buffer for 𝑅 ( 𝐹𝑜𝑉 𝑓 ) s to be encoded at a higher quality. Fig. 5 (b) shows that polygons for 𝑅 ( 𝐹𝑜𝑉 ) (e.g., R3 ) are extracted removing all the polygons generated for 𝑅 ( 𝐹𝑜𝑉 𝑓 ) (e.g., R1 , R2 ). Note that extracting 𝑅 ( 𝐹𝑜𝑉 𝑓 ) as rectangles creates holes in 𝑅 ( 𝐹𝑜𝑉 ) region. Because the basic MNC algorithm is pro- posed for hole-free rectilinear polygons, we have added additional steps on top of VASTile MNC implementation. In brief, when find- ing the maximum independent chords, we consider the vertices of holes. Then, from the remaining vertices, which have not connected with any independent chord, we draw extra chords to complete the partitioning. We avoid hovering any chord on the holes. Secondly, taking 𝑉 𝑀 𝑏𝑢 𝑓 ( 𝛽 ∗ , 𝑏 ) and 𝑉 𝑀 𝑜𝑜𝑣 frames, the MNC algorithm partitions 𝑅 ( 𝐵𝑢 𝑓 ) and 𝑅 ( 𝑂𝑜𝑉 ) respectively (Fig. 2a-i and Fig. 2a-j) without any further processing on the polygons around the blobs detected. Finally, we see that due to the close proximity of selected blobs certain DTs may overlap on each other. Given such two tiles, we remove the overlapped region only from smaller DT ensuring a non-overlapped DT coverage on the entire frame. 𝑅 ( 𝐹𝑜𝑉 ) Rectilinear boundary 𝑅 ( 𝐹𝑜𝑉 ! ) Rectilinear polygon for 𝑅 ( 𝐹𝑜𝑉 ) Rectilinear polygon 𝑅 ( 𝐹𝑜𝑉 ! ) Further expandable areas in 𝑅 ( 𝐹𝑜𝑉 ! ) R1 R2 R3 R n Regions for partitioning (a) (b) Figure 5: 𝑅 ( 𝐹𝑜𝑉 ) + 𝑅 ( 𝐹𝑜𝑉 𝑓 ) partitioning: (a)-before & (b)-after expanding rectilinear polygon of 𝑅 ( 𝐹𝑜𝑉 𝑓 ) Center line (Hori) 4 4 4 4 4 6 6 6 4 4 4 Center line ( Verti ) 14 12 12 12 12 8 8 6 10 14 10 max tile size ( 𝛾 𝑉 𝑇 !"# ) verti . dir. (in basic tiles) max tile size ( 𝛾 𝐻𝑇 !"# ) Hori. dir. (in basic tiles) Further partitioning starts from center line 𝛾 𝐻𝑇 !"# 𝛾 𝑉𝑇 !"# Remaining area Figure 6: Further partitioning mechanism and maximum al- lowable tile size in horizontal and vertical direction based on the vertical position of center of a given DT. 4.4 Post-processing 4.4.1 Further partitioning of bigger DTs. We further partition DTs beyond a certain limit of the size in order to reduce the pixel re- dundancy. Because we consider polygon boundary for the parti- tioning, we may encounter DTs even bigger than FoV size horizon- tally/vertically or in both. Hence, any slight overlap with such tile incurs large pixel redundancy, and therefore, we define a maximum allowable DT size considering the FoV distortion variation accord- ing to its vertical position. For example, VPs located towards polar region allow to have a larger DT size as the corresponding FoV on the ERP frame spread in a larger region compared to the equator. Thus, considering the no. of overlapped BTs with the distorted FoV maps on the ERP frame, we define the maximum allowable DT size in both vertical ( 𝑉𝑇 𝑚𝑎𝑥 ) and horizontal ( 𝐻𝑇 𝑚𝑎𝑥 ) directions as shown in Fig. 6 y-axis. Further reducing 𝑉𝑇 𝑚𝑎𝑥 and 𝐻𝑇 𝑚𝑎𝑥 supports decreasing redun- dant data transmission as the DT size becomes smaller, nonetheless, incurs high encoding overhead. To see the impact, we take 𝛾 .𝑉𝑇 𝑚𝑎𝑥 and 𝛾 .𝐻𝑇 𝑚𝑎𝑥 , where 𝛾 ∈ [ 0 , 1 ] . We set 𝛾 ∈ { 0 25 , 0 5 , 1 0 } in our ex- periments. Decreasing the 𝛾 results in smaller tiles. After detecting larger DTs, we start partitioning outwards from the center lines as in example tiles in Fig. 6. This is because the majority of the user VPs concentrate around the center of the frame [ 5 , 13 ]. Therefore, to reduce potential quality changes within the tile, we keep those DTs near the center lines are non-splitted as much as possible. Finally, quality allocated tile scheme can be achieved as in Fig. 2a- l considering the multiple properties of DTs such as pixel intensity, size and location of the tile. Implementing a proper bitrate allocation scheme will be kept as our future work and such interactions with bitrate allocation are discussed in Section 7. 5 EVALUATION SETUP Dataset: We develop and validate algorithms in VASTile leveraging VP traces collected from 30 videos from three different datasets [ 13 , 16 , 26 ]. All videos are in 60 s duration with 30 fps. VP center is denoted by < 𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ > angle. Each video has 30 users and we take VP traces from randomly selected 20 users to develop tile schemes using VASTile and the remaining 10 user VPs to validate the performance of VASTile . The selected videos represent 360 ° video categorization proposed in [ 3 ] and are in different genres such as sports, documentary, stage performance etc. Hardware and software setup: We implement VASTile archi- tecture using Python, which consists of 4500 lines of code, on MacOS–intel Core i9 2.3GHz single core CPU. We use Networkx- 2.4 [ 8 , 17 ] package for implementing the MNC algorithm 7 . Videos in HD ( 1920 × 1080 ) and 4K ( 3840 × 2160 ) resolutions are encoded using FFmpeg -4.1 in H265 (HEVC) provided by libx265 at default Quantization Parameter (QP)=28 with motion constrained tiling. Evaluation metrics and comparison benchmarks: We com- pare VASTile with 3 fixed tile configurations 4 × 6 [ 20 ], 6 × 6 [ 9 ], and 10 × 20 [ 13 ]. To evaluate VASTile with viewport-aware streaming, we use two metrics. i) % Pixel redundancy before compression: extra pixels in selected DTs, but not overlapped with the user FoV using Eq. 1. Higher the pixel redundancy, pixel level operations will increase in video encoding/decoding. ii) Downlink (DL) data volume: data transmitted by selected tiles in individual user VPs, which impacts the bandwidth saving. Total no. of DTs covering the entire frame when 𝛾 = 1 and 0 5 is near similar to the fixed tile 4 × 6 and 6 × 6 configurations respectively. We compare VASTile ’s relative gain to fixed tiling using the above metrics with the corresponding configurations as C1 : 𝛾 = 1 to 4 × 6 and C2 : 𝛾 = 0 5 to 6 × 6 6 RESULTS 6.1 Distribution of DTs We analyse the DT distribution on each region: 𝑅 ( 𝐹𝑜𝑉 𝑓 ) , 𝑅 ( 𝐹𝑜𝑉 ) , 𝑅 ( 𝐵𝑢 𝑓 ) and 𝑅 ( 𝑂𝑜𝑉 ) . First, we vary 𝛾 , which controls the maxi- mum allowable DT size (see Section. 4.4), to see the corresponding 7 For bipartite graph generation for searching maximum independent chords in a given polygon (cf. Section 3.0.1) Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4560 variations in no. of DTs and their avg. size in BTs on each region. Table. 1 reports the averaged results for all the frames in 30 videos. For each 𝛾 value, around 31% of DTs on the entire frame covers 𝑅 ( 𝐹𝑜𝑉 ) ; however, the avg. tile size is 37.3% lower than the DTs in 𝑅 ( 𝐹𝑜𝑉 𝑓 ) . This is due to the region expansion of 𝑅 ( 𝐹𝑜𝑉 𝑓 ) during Partitioning step (see Section 4.3) and the remaining regions on the 𝑉 𝑀 ( 𝛼 ∗ , 𝑏, 𝜁 ) covered by 𝑅 ( 𝐹𝑜𝑉 ) are smaller patches surrounding the 𝑅 ( 𝐹𝑜𝑉 𝑓 ) . Second largest tiles are derived in 𝑅 ( 𝑂𝑜𝑉 ) area as the maximum allowable tile size is higher near the upper and bottom regions of the ERP frame. Table 1: DT distribution in 4 regions: no. of DTs (# T ) and avg. tile size in BTs (S) variation by 𝛾 𝛾 𝑅 ( 𝐹𝑜𝑉 𝑓 ) 𝑅 ( 𝐹𝑜𝑉 ) 𝑅 ( 𝐵𝑢 𝑓 ) 𝑅 ( 𝑂𝑜𝑉 ) Total # T S # T S # T S # T S # T 0.25 18 3.2 19 2.8 11 3.0 16 3.6 64 0.50 8 6.8 13 4.0 8 4.0 11 5.4 40 1.00 4 13.4 9 5.6 6 5.6 9 6.4 28 Fig. 7 shows the temporal variation of DT distribution for the entire video duration for 𝛾 = 0 5 . Fig. 7a and Fig. 7b show that DT distribution becomes stable within first 5s. For example, in Fig. 7a, no. of DTs in 𝑅 ( 𝑂𝑜𝑉 ) decreases from 15 to a stable value 10. In contrast, for DTs in 𝑅 ( 𝐹𝑜𝑉 𝑓 ) and 𝑅 ( 𝐹𝑜𝑉 ) , the same value increases from 5 to 10 and 10 to 14 respectively. Fig. 7b shows that DT size of 𝑅 ( 𝑂𝑜𝑉 ) decreases from 7 to 5 (in BTs), whereas DTs in 𝑅 ( 𝐹𝑜𝑉 𝑓 ) and 𝑅 ( 𝐹𝑜𝑉 ) keep nearly the constant sizes at 6 and 4. Thus, within first 5 s VASTile generates larger 𝑅 ( 𝐹𝑜𝑉 𝑓 ) DTs and high no. of 𝑅 ( 𝑂𝑜𝑉 ) DTs as the user VPs are concentric to a certain area. When the user VP starts spreading on the frame, VASTile generates more 𝑅 ( 𝐹𝑜𝑉 ) and 𝑅 ( 𝐹𝑜𝑉 𝑓 ) DTs reducing the 𝑅 ( 𝑂𝑜𝑉 ) DTs. Fig. 7c shows % user VP overlap with DTs in 4 regions. More than 50% and 30% of individual user VP overlaps with DTs from 𝑅 ( 𝐹𝑜𝑉 𝑓 ) and 𝑅 ( 𝐹𝑜𝑉 ) enabling the allocation of high quality for DTs in the user VPs. Fig. 7d shows the proportion of each DT area overlapped with the user VP. Starting from 80%, avg. overlapped proportion of DT tiles on 𝑅 ( 𝐹𝑜𝑉 𝑓 ) decreases to 70% because at the beginning, 𝑅 ( 𝐹𝑜𝑉 𝑓 ) DTs provide finer boundary to the individual user VPs, but slightly fails with VP dispersion. Only 48% of the area of 𝑅 ( 𝐹𝑜𝑉 ) DTs overlaps with user VPs since many 𝑅 ( 𝐹𝑜𝑉 ) DTs cover boundary of the high visual attention areas. Fig. 8a shows the Total Pixel Intensity per Basic Tile (PI/BT = 𝑠𝑢𝑚 𝑜 𝑓 𝑝𝑖𝑥𝑒𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝐷𝑇 𝑛𝑜. 𝑜 𝑓 𝐵𝑇 𝑠 𝑖𝑛 𝐷𝑇 ) for the DTs in 4 regions. Overall, 𝑅 ( 𝐹𝑜𝑉 𝑓 ) and 𝑅