# Dynamic Content Caching:Main Page

With the mushrooming of various video sharing sites on the web and the ease of sharing videos provided by social networking sites, video streaming now forms a large chunk of the traffic generated on the Internet. The traditional web caches do not cache dynamic contents and hence do not cache videos by default. In this project we try to analyze the gain achievable by caching videos being viewed in a university campus network (Indian Institute Of Technology Delhi). We also propose an architecture for designing such a cache keeping in mind various technological and user experience constraints.

## Motivation

With increase in User Generated Content over Internet and its subsequent sharing on Online Social Networks (OSNs), video sharing has become a common phenomena among Internet users these days. However, because there is spatial locality among friends of OSN, a single video is being watched by many of friends in same area [1]. The Squid proxies general used on such networks do not cache dynamic content, hence videos, by default. But with the increase in the share of bandwidth usage by dynamic content, specifically videos, caching them locally makes sense in this scenario. This will reduce load on the centralized server while decreasing start up times for video viewing improving the QoE (Quality of Experience). Moreover the Internet bandwidth usage for the institution would also decrease. The most important reason for not caching videos are cost reasons (memory) and small percentage of video content in the Internet traffic statistics. Both the factors have undergone a drastic shift with memory becoming cheaper and increase in video traffic over Internet.

The figure on the right explains how requests to Youtube are routed in a standard campus like environment:

How requests to Youtube are routed in a campus like environment

A request from a client (1) for a Youtube video is ﬁrst handled by a proxy server. It redirects the request (2) to Youtube server which in return sends back the address of the CDN server where the video is located. Client then request for video from cache (3) which usually streams it in form of a ﬂash media content. We can see an immediate advantage of caching at the proxy system since the requests for the same videos can be satisfied quickly hence saving external bandwidth of the institution.

## IIT Delhi Logs

To have a proof of the above made assertion, we wanted to analyze proxy logs of IIT Delhi so as to show the eﬀectiveness of caching. We got a 2-day log (Jan 24-25, 2011) from the Computer Services Center (CSC) of IIT Delhi which had the following format:

• Time: When was the request made.
• Duration: For how much time server was busy with this request.
• Result code: Squid results codes. Please check [2] for details.
• Bytes: Amount of data transferred as a results of this transaction.
• Request method: Please check [2]
• URL: The URL of the video for which request is made.
• Username: Who requested the video.
• Hierarchy code: Please check [2] for details.
• Type: Please check [2] for details.

Logs were collected from one of the web-proxies in IIT Delhi giving the details of the cache access by B.Tech students of the institute. Unfortunately the Squid servers are configured such that they do not save the complete URL of dynamic pages and strip the query after ”?”. We requested Head CSC, to change the Squid configuration for our work but changing the configuration required re-provisioning of all the proxy server resources and hence our request could not materialize.

In the paper [1] the authors have recorded Youtube access trace from a campus network. Assuming that similar usage pattern would be observed in IIT Delhi campus network as well, we requested the authors to provide us the traces that they had generated for their work. After receiving the traces, we did number of simulations on the datasets and found that caching indeed helps in eﬀective utilization of the recources. In the next subsection we discuss in detail about the structure of the trace ﬁles and the analysis we have done on them.

### Measurement Philosophy

For the complete methodology of how the traces were collected, please refer to [1]. Here we do a brief documentation of the logs explaining what was the information that was available to us. The data covers a measurement period between June 2007 and March 2008. There are two different set of traces: youtube.parsed.*.dat: These trace files contain information about client requests for YouTube\tm video clips. Each line in a file represents such a request and is of the following format:

Timestamp YouTube server IP Client IP Request Video ID Content server IP
1189828805.208862 63.22.65.73 140.8.48.66 GETVIDEO lML9dik8QNw 158.102.125.12

It is important to note that for the sake of privacy, IPs have been anonymized. flows.*.dat: These trace files contain information about the tranport session for video clips requested by clients from the UMass campus network: Each line in the trace files contains information about such a session, important columns of which are mentioned below:

Id Source IP Destination IP Start_time Finish_time Size in bytes
5 64.15.112.107 148.85.44.11 0.464492 74.901594 499.180

### Video Stats

We analyzed three different traces collected on different dates, details of which are given below:

Trace Date Number of Requests
1 06-18-2007 38,998
2 09-15-2007 145,140
3 01-29-2008 611,968

Each of the three traces has different number of requests which helped us to analyse whether there are any temporal constrains associated with the measurements. From the traces, we got we found that the average video size was 6.671 MB and the distribution of videos is shown in the following figure:

We observe that most of the videos are of comparatively small size. So even a relatively smaller cache can be used to store a lot of vidoes. Also the distribution is similar in both the cases meaning that there are no temporal constrains associated with the data set. We also analyzed the number of hits each video got in our data set. Note that this is crucial, since if there is a video that has lots of hits, it makes sense to cache that video.

Most of the videos were viewed only once, but there were some popular videos that were watched again and again. To further check whether the videos that are popular have their hits in a localized time interval, we took the video that was most popular in Trace 1 and plotted the time stamps on when it was called.

Distribution of Timestamps for a popular video

We observe that most of the requests to a popular video are localised in time, hence making the idea of caching all the more attractive.

### Least Recently Used (LRU) Simulator

We wrote a python simulator to simulate the Least Recently Used (LRU) cache replacement policy. We assumed that all the external requests from the University campus go via a proxy system and there is a cache which caches the dynamic content in the form of the videos. After running the traces on our simulator, we obtained the following graph:

LRU Cache Simulation

There are various observations we can drive from the above graph:

1. Caching in this scenario is highly effective. Hit Ratio achieved a value of 0.5 when there was infinite cache in one case. In other cases also hit ratio is high suggesting that this is an effective way to use the given resources.
2. Effectiveness of caching increases with an increase in size of cache, which is to be expected.
3. The time duration does not have an impact on the caching effectiveness. Each trace has different number of videos requests (because of the greater time for which the trace was collected), yet they show almost identical behavior.

## Which videos to cache?

In this section, we try to answer this questions by doing some simulations on the datasets we have.

Relative Standard Deviation: Relative standard deviation is defined as

$RSD=100\times \frac{std}{mean}$

As compared to traditional standard deviation, the RSD allows standard deviations of different measurements to be compared more meaningfully as they are normalised by the mean of the data. For example compare two measurements A and B in which the result of A gives $2 \pm 1$ and the result of B gives $10\pm 3$. If we just consider standard deviations, then it will be claimed that A is more precise but actually it is B which is more precise since it has lesser percentage of variation. We have used RSD as a criteria in all our simulations. We did analysis of the top 20 videos in our dataset. We noted down the times when requests to these videos were made and tried to calculate the relative standard deviation of these times. The initial results showed an RSD of only 0.007% with all the videos showing similar RSDs. This result was surprising since it meant that almost all the videos were called almost at the same time and there was negligible scatter around the mean. We looked further into the code and found that the times recorded in the datasets were actually in Coordinated Universal Time (UCT) which measures the time as the number of (micro)seconds passed since 1 January 1972. This added a large constant in the mean value and decreased the RSD considerably. We took the minimum time value in our dataset and normalised all the recorded times by decreasing this minimum time from them. The RSD of these new values was 37% and the top 20 videos showed a significant variation in their RSDs.

A graph of RSDs of top 20 videos showing no discernible pattern

Based on the variation in the RSDs, the videos could be categorized into two sets:

1. High RSD videos(HRSD): These are the videos that show considerable variations in their requests. Since these are among the top videos, an explanation for such videos is that these are the videos with global popularity and will be called at sufficiently regular intervals over a given period of time.
2. Low RSD videos (LRSD): These are videos whose requests are clustered around their mean value and all the requests are contained in a small interval. These videos have local popularity. When called for the first time, we can expect a lot of requests for these videos in the near future and when a sufficient time had passed since the last request, it can be assumed that this video can be safely eliminated from cache.

## A new cache replacement algorithm

At any point in time, a video in cache will be categorised into one of the two groups, either HRSD or LRSD. As we have shown, that there are good chances an LRSD video will not be called in future if it has not be called for a sufficient time. Let us assume this time to be /delta (What is a good value of data depends on the local scenario and can be tuned with time). So a video in cache which is in LRSD group maintains a counter when it was last called. If that exceeds delta, it makes itself available for eviction. Hence the new algorithm can be framed in this way:

Input:
A request for a video V
Algorithm:
if (V exists in Cache):
Satisfy the request, nothing to be done.
else:
Check LRSD group for a video that has exceeded its /delta
if (there exists such a video):
Remove this video, and add V to the cache.
else:
Do an LRU eviction from the HRSD group


One of the major problems we have not addressed is how do you categorise a video on its first request into one of these groups. While statistically there seem to be no solution of this problem, we can use contextual information obtained from some other sources to categorise it. e.g. If it is a news video, it makes sense to put it into LRDS group or if it is an old song, put it into HRSD group. One can come up with a table of such mappings and use it to associate a video with a group. The videos can be subsequently shifted between these two groups based on new events. If such contextual information is not available, then one of the possible option is to wait for some time before assigning it a group. If in that time, there is no subsequent request, then assign it to HRSD otherwise to LRSD.

## Final Proposed Cache Architecture for IIT Delhi

In this section we summarize the various available options for caching video at IIT Delhi campus network and the final design proposal we have come up with. In order to arrive at this proposal we try to combine the various constraints of IIT Delhi campus network and the results derived from the analysis of the youtube_traces as shown in the previous sections.

### Comparison of Available Options

The various available options for the video caching system are:

1. Using existing Squid proxy servers to cache the videos.
2. Create a separate P2P caching system for the videos preferably integrate with a browser plugin.
3. Create a centralized video cache/server integrated with a browser plugin.

The Squid proxies cannot be used for caching video content as by default squid has not been designed to cache dynamic content (videos in this case). Moreover, youtube.com specifically implement several 'features' that prevent their flash videos being effectively distributed by caches [3]. The solutions which are reported to be working solutions technically break the HTTP standards [3]. Hence, it is not advisable to use this approach in IIT Delhi campus network neither do we hope to get the permissions from authorities to make such changes to the Squid configuration of the servers serving to entire campus population.

The idea of using P2P cache for caching the videos is quite attractive. This architecture has many advantages which are basically the usual advantages attributed to P2P design, which includes:

1. Resource requirement would be shared by the user.
2. Increase of users would bring in more resources as opposed to causing resource crunch.
3. Increased robustness of the system.

But there are many hindrances and disadvantages of using P2P architecture for the intended video cache:

1. Inter-hostel connection is not possible in IIT Delhi network, which is the biggest hindrance in implementing any P2P network in IIT Delhi campus.
2. Selectively sharing downloaded video and nothing else is important to maintain security and privacy.
3. Requirement of browser plugin for making these P2P connections as user would not be willing to switch from browser to some other standalone application in order to view a video. For the popular web browser Firefox we found only one such addon/plugin "AllPeer" [4] but unfortunately the development on this plugin has been discontinued and the source code has been made public only for the client side and not for the server side [5].

From the above discussion we see that the advantages heavily weigh towards the usage of central proxy for video.

### Architecture Proposal and Implementation

With the non-applicability of both the cleanest possible solutions (first, using Squid to cache videos and second, using sharing the videos cached by each user in a P2P manner) because of not being implementable in IIT Delhi campus network we are left with the option of a central cache. As a result we have come up with our own system which is explained in proposal image. In this architecture whenever a user wants to play a Youtube video he can use a browser addon to find out if the video is available in the campus cache. If the video is available in the campus cache then he can play the video directly from the local server resulting in smaller latency and no campus Internet bandwidth usage. If there is a miss on the cache and the user wants to view the video the video can be downloaded to the cache and the user can then watch it.

As shown in proposal image we have implemented the central server and the cache required for this design. In the sample implementation the server is an HTTP server which receives the request from the user (which is the browser plugin in this case) for a particular video. It contacts the cache (a CGI script) to check whether the requested video is available in the cache or not. The linux filesystem has been used as a database for the videos as the only two things required in this case is a name based lookup and the last access time (considering LRU eviction), both of which are provided in the filesystem itself using the simple commands of ls or find. If the cache registers a hit for the requested video then it returns the video location to the server which forwards it to the user-agent. If there is a miss then a python script youtube-dl [5] is executed to download the video from Youtube and save the video in the cache after evicting some other video, if necessary.

## Future Work

We have identified the following as the future work in this particular project:

1. First of all, we need to complete the work shown as future work in the proposal image which chiefly involves building the browser plugin to provide the user to give a smooth user experience with the video proxy as well.
2. In this report we have proposed new cache eviction algorithm. This algorithm requires to classify videos as HRDS and LRSD based on some heuristics. We have proposed one such heuristic in this report. More such heuristics can be made and an empirical testing on the Youtube traces is required to find out the best out of them.
3. Due to non-availability of proper proxy logs from IIT Delhi proxy servers we had to use the traces from UMass university. Whether our assumption of similar video watching behavior of various university campuses holds good or not needs to be tested. For this similar analysis of the IIT Delhi proxy logs is required.
4. In this project we have concentrated only on Youtube videos. There is plethora of video sharing sites available on the Internet these days. Some analysis and designing of the cache system is required in order to cache videos from at least the most popular of these video sharing services.

## References

1. 1.0 1.1 1.2 Michael Zink, Kyoungwon Suh, Yu Gu, Jim Kurose, Characteristics of YouTube network traffic at a campus network- measurements, models,and implications
2. 2.0 2.1 2.2 2.3 Squid Log Format