IPTV has been widely deployed throughout the world, bringing significant advantages to users in terms of the channel offering, video on demand, and interactive applications. One aspect that has been often neglected is the ability of precise and unobtrusive telemetry. TV set-top boxes that are deployed in modern IPTV systems can be thought of as capable sensor nodes that collect vast amounts of data, representing both the user activity and the quality of service delivered by the system itself. In this paper we focus on the user-generated events and analyze how the data stream of channel change events received from the entire IPTV network can be mined to obtain insight about the content. We demonstrate that it is possible to predict the occurrence of TV ads with high probability and show that the approach could be extended to model the user behavior and classify the viewership in multiple dimensions.
Internet Protocol Television (IPTV) has become an essential part of modern triple-play offerings and is being widely deployed worldwide. In addition to its advantages in terms of enhanced and interactive content, and the possibility to include the long-tail content in the provider’s offering, such systems also support effortless collection of data at multiple levels in the system [
In this paper we focus on analysis of the channel change data flow [
Most previous works on ad segment detection [
Such crowdsourced ad detection can have many practical applications, both from the advertisers’ and regulators’ point of view, as well as the viewer’s. For instance, broadcasters and ad agencies could adjust the content timing and form to increase engagement. Regulators could use such system as an additional data source to monitor TV stations and whether they obey regulations regarding excessive broadcasting of commercials [
In this section we describe the data and the methods used for ad detection. Our research was conducted on pseudoanonymized channel change event data obtained from an ISP, a dataset of TV ad segment timestamps and durations, and additional synchronization information obtained by manually aligning both datasets to the live video feed.
Two main datasets were used in our research. Firstly, a near-real-time pseudoanonymized event stream of all channel change events in the IPTV network was obtained from the ISP. The source of the data stream was diagnostic SNMP trap messages, sent both periodically and on every channel change event. This provided a continuous and time-varying data stream, ranging from 600 to 5000 events per minute. The data stream contained all channel changes of a set of 10.000 users, randomly sampled by the ISP. Two months’ worth of data (from June 1, 2015, to August 1, 2015) was captured and stored into a key-value database (Apache Cassandra). The database scheme was optimized for efficient retrieval of data by time range and used the event timestamp as the key.
Next, a set of TV ad segment start and end times was obtained from the national TV broadcaster for a limited subset of the channels. We limited our analysis to the largest national TV channel, which captures a significant portion (16%) of the national viewership.
Since both described datasets originate from two different systems (and two providers), there were no guarantees that the timing in both was in sync; thus, to be able to train our algorithm, the first step in data preparation was the synchronization of both datasets.
By plotting both datasets against time, a noticeable lag was evident (Figure
Channel changes in time with labeled ad segments. The interval in the chart represents 30 minutes (with each bar representing 10 seconds) on June 20, 2015, from 23:22 to 23:52. The height of the bars of either color represents the number of channel changes by all of the observed viewers on the channel; red color highlights the periods with labeled ads, as provided by the broadcasting company. A significant time lag is noted due to misaligned timestamps, which have to be adjusted before the training phase begins.
As we only had historical TV ad segment data, we could not check the alignment with TV footage manually. To calculate the offset of the ad segment data we identified the ad segments that repeat at known hours and calculated the offset to the IPTV live broadcast.
To synchronize the pseudoanonymized channel change data stream offset, we had to engineer an unlikely pattern of channel changes: starting early in the morning, when viewership was low, on a specific channel, we made a channel change event every minute on the minute mark, for 20 minutes. Finally, we identified the pattern in the captured dataset and were able to calculate the offset in the event timestamps.
Lastly, we have subtracted both offsets and we found a 70-second lag between the two datasets, which we used to adjust the smaller dataset (ad segment timestamps). Some inaccuracy on a scale of a couple of seconds is possible, but we have assessed that it does not affect the algorithm, which works on a larger time scale.
The first step in data cleaning was removing all periodically reported events that did not carry channel change information (viewer remained on the same channel). Next, we discarded the events of all the channels, except the largest one, by viewership, which matched the ad segment data. The initial dataset size averaged 2 GB per day (for 60 days, from June 1, 2015, to August 1, 2015), which totals 120 GB. After filtering, the remaining dataset size was approximately 200 MB per day on average, totaling 12 GB of raw data throughout the studied timeframe.
Data preparation was done in the following way: by aggregating the data we obtained time series that represented a number of channel changes at a specific time of each day with 1 second granularity. The resulting time series data served as the first input to our algorithm. The time series is defined as a sequence of pairs
Both the channel change dataset and the time-adjusted ad segment dataset were split into 3 equal parts: the labeled dataset, the training dataset, and the validation dataset, each spanning 20 days. For each TV ad segment in the labeled dataset we used the start and end timestamps to generate subtime series from the overall channel time series. We thus obtained several hundred time series representing a number of channel changes per second throughout the ad segment, relative to the start of the commercial. An example of a TV ad segment is represented in Figure
Example of a TV ad segment (red). The interval in the chart represents 5 minutes (with each bar representing 10 seconds) on June 30, 2015, from 19:52 to 19:56.
The algorithm depends on the time series comparison and we define similarity function of two time series. Given two time series
The last phase builds a logistic regression model to predict whether TV ad was played in that interval or not. Logistic regression is mathematical modeling approach which can be used to describe the relationship of several independent variables to a dichotomous dependent variable, such as TV ad occurrence [
For building a logistic regression model we use IPTV time series data from the training dataset. We use one-parameter logistic regression with custom dependent variable that is calculated as suggested below.
We loop through IPTV time series in training dataset and for each 10-second interval we calculate distance from IPTV time series to all labeled commercials time series generated in the data preparation phase. For each 10-second interval we take the minimum distance. For each interval we check if an ad was actually played based on the shifted broadcaster ad segment timestamps. That leads to pairs of numbers
The model was tested on the validation dataset, where we used our model to predict the ad segment and validate it against the broadcaster’s ad segment data.
The Receiver Operating Characteristic (ROC) curve [
ROC curve of the proposed model.
Results depend on users’ viewing habits that change through the time of day and are strongly dependent on the type of the TV content as well. For example, the news broadcasts on the channels that we studied tend to have an ad segment at exactly the same time every day; regular viewers learn to anticipate when the ad segment will start and change the channel preemptively. Similarly, some TV shows have foreshadowing events before the ad segment, which trigger the viewers to change the channel even before the ad segment starts. Examples of our prediction model results during a movie (Figure
Graph with TV ad segment prediction on commercial block during a movie. The interval in the chart represents 5 minutes (with each bar representing 10 seconds) on July 22, 2015, from 23:27 to 23:32. The beginning of the segment is determined correctly (with precision of <10 seconds), while the end of the segment is overestimated.
Graph with ad-prediction on an ad segment during a news show where our prediction model predicts the segment too soon. The interval in the chart represents 5 minutes (with each bar representing 10 seconds) on July 22, 2015, from 19:33 to 19:39. In the case of a regular commercial break, the anticipation and channel switching before the segment starts triggers a false detection.
The reason this error occurs is that we only take into account number of channel changes and usually ad segment starts with sudden increase in channel changes as seen in Figures
In this paper we have presented a model that detects TV ads based on synchronous channel change events. We trained the model based on the logistic regression using the labeled ads provided by the broadcaster. After the training, the model performed with high reliability (88%). This also confirms the hypothesis that a high percentage of the unwanted content is in fact advertising.
Algorithm depends on time series similarity measure that is defined in a manner that fits the ad properties reasonably well. Considering that the number of ad sequence time series representatives is a constant and this time series length is small, the computational complexity of the algorithm is
In the future work we will continue to focus on crowdsourced data mining of the IPTV network. Further work also needs to be done on detailed classification of the undesired segments, for example, dividing such content into ads and other types of content, which would provide an automated TV ratings system with relatively high temporal resolution and would have many practical applications. The second field of future work will be focused on user modeling based on the viewing habits. By clustering the viewers by their content voting preferences, a finer viewer model could be developed that would offer personalization and content discovery options to aid the viewer.
The authors declare that they have no competing interests.
The authors would like to thank companies Telekom Slovenije and RTV Slovenija for providing the datasets and all collaboration that made this research possible. The work was supported by the Ministry of Education, Science, and Sport of Slovenia and the Slovenian Research Agency.