Development, Events

Google Hashcode 2017 – How we managed to solve the proposed problem

March 23, 2017

March 23, 2017 by Christian Expósito

The HashCode challenge, a team-based programming competition organised by Google where people from Europe, the Middle-East and Africa try to solve a problem proposed by the company, took place during during the evening on the 23rd February, and Ebury was there to cover such an awesome event.
Our Malaga office was open for those who wanted to join us and have some fun between code lines, flowcharts, loops, and all things on which coding freaks like me enjoy spending time.

This time, Google asked us to create a solution to improve the performance of the way a video streaming platform delivers its videos to the end users: we were provided with several cache servers that were accessible from certain endpoints and we were able to improve the load time of a specific video if using them. In any one video, we have a file detailing how many end users are trying to watch it, how much time the platform takes to load that video (latency), what cache servers can their endpoints access, and what is the latency of each one (read the complete problem here).


server_organisation

The file with the information of each video has the following structure:

example_file

At this point, it seemed to be a complex task… so, how many days did Google give us to do it? Well, it wouldn’t be a challenge without a little time pressure – so they decided to give us just 4 hours to solve this problem.

When you have to complete an exercise like this one in a short period of time, and you don’t have any kind of limitations at the time of submitting the results (i.e. you can submit an initial solution and continue to submit replacement solutions as many times as you want), the best thing you can do is to begin with an “easy” but scalable solution – although you don’t get as many points doing it this way – and then improve that solution as much as you can. Perhaps you will not find the perfect answer, but you will ensure a fistful of points that you wouldn’t get if you were focused on finding the best solution, and the time is over before you can submit a result.

Let’s begin with something easy: Reading the file

Every programmer knows how difficult is to come face to face with a blank file that needs to be filled with lines upon lines of our beloved programming language, so this time we thought back to the advice given to us when we were mere beginners in the programming world: Divide and Conquer. In order to solve this problem, the first thing we have to do is to be able to interpret the data, so we begin with the piece of code that will be able to read everything included in the file that we are provided with.

def read_dataset(filename):

   data = open(filename)

   num_videos, num_endpoints, num_requests, num_cache_serv, serv_capacity = map(int, data.readline().strip().split())



   videos = map(int, data.readline().strip().split())



   endpoints = []

   for element in range(num_endpoints):

       latency, connected_servs = map(int, data.readline().strip().split())

       endpoint = {'latency': latency, 'caches': [], 'videos_req': []}

       for connection in range(connected_servs):

           server_id, serv_latency = map(int, data.readline().strip().split())

           endpoint['caches'].append({'latency': serv_latency, 'id': server_id})

       endpoints.append(endpoint)



   requests = []

   for request in range(num_requests):

       video_id, endpoint_id, n_requests = map(int, data.readline().strip().split())

       request_dict = {'video_id': video_id, 'endpoint_id': endpoint_id, 'n_requests': n_requests}

       requests.append(request_dict)

       endpoints[endpoint_id]['videos_req'].append(request_dict)



   return {'initial': (num_videos, num_endpoints, num_requests, num_cache_serv, serv_capacity),

           'videos': videos, 'endpoints': endpoints, 'requests': requests}

This one, for example. Nothing special to say here; we know how many videos, endpoints, requests, cache servers and their capacity we have on each specific case, and then we iterate over the endpoints and requests for collecting all the information which will be useful for checking which videos should we put on which cache servers.

Ok, now our solution’s file is not empty, and you are being filled with determination with every line of code you write in the file. Let’s move on to the next part.

Classifying videos

Now that we are able to “understand” the content of the file with the information of each video, we need to manage that information in order to decide how to use the cache servers to improve the performance of our video streaming platform. We made the decision to assign a score to each video, relating to the number of times it is requested by each endpoint, the amount of time gained for each video when going through a cache server (rather than a regular server), and the cost of storing that video on each one (the percentage of cache memory that we need to use for laying up the video):

Time = Nº Requests (Endpoint Latency - Cache Server Latency)

Video Weight = Cache Size - Video SizeCache Size

Video Score = Time Video Weight

 

 

So, let’s translate this into Python:

def calculate_video_cache_score(data):

   endpoints = data['endpoints']

   cache_size = float(data['initial'][-1])

   for endpoint in endpoints:

       video_cache_reduction = {}

       for v in endpoint['videos_req']:

           for c in endpoint['caches']:

               key = str(c['id']) + '-' + str(v['video_id'])

               value = (v['n_requests'] * (endpoint['latency'] - c['latency']))

               weight = (cache_size - data['videos'][v['video_id']]) / cache_size

               video_cache_reduction[key] = value * weight

       endpoint['reduction'] = video_cache_reduction



   return data

We also decided to calculate the total reduction of load time that each endpoint receives if a certain video is stored in a certain cache server, sorting those results from the greatest value to the lowest value:

def totalize_scores(data):

   total = {}

   for endpoint in data['endpoints']:

       for cache_video_id, video_score in endpoint.get('reduction', {}).iteritems():

           score = total.get(cache_video_id, 0)

           score += video_score

           total[cache_video_id] = score



   data['total_score'] = [(cache_video_id, total_gain) for cache_video_id, total_gain in total.iteritems()]

   data['total_score'].sort(key=lambda x: -x[1])

   return data

Now we have a data structure where the videos are scored according to the amount of time we can gain back on each endpoint by having them allocated in a certain cache server, and those scores are sorted from the best result to the worst one. The hardest part of this task is done!

Giving a solution

After this point, we only need to fill the cache with the videos that give us a better time improvement. Don’t forget that cache servers have a limited memory storage, and we are not allowed to exceed it under any circumstance:

def solution(data):

   from itertools import chain



   caches = [data['initial'][4] for _ in range(data['initial'][3])]

   video_length = data['videos']



   solution = {}

   already_cached = []

   for score in data['total_score']:

       cache, video = map(int, score[0].split('-'))

       if caches[cache] >= video_length and video not in already_cached:

           videos_in_cache = solution.get(cache, [])

           videos_in_cache.append(video)

           caches[cache] -= video_length

           solution[cache] = videos_in_cache

           already_cached.append(video)



   return [' '.join(map(str, chain([k], v))) for k, v in solution.iteritems()]

You will see that we are also avoiding storing the same video in two caches – this is because we saw that the endpoints were able to access the same cache servers, so we decide to give it a try… and we achieved a big improvement in our final score! Remember our philosophy: work over an idea, and try to maximize the results with it.

Last but not least: Exporting the results into the output file

We have spent lot of time thinking about an answer for this exercise and transforming those ideas into code, but we are not finished yet; we have to prepare the file that Google will review in order to give us our final score.
They asked us to upload a file which has the following format:

example_submision

What a surprise! Did you notice that? Our “solution” method is returning a list to us that contains a string representing the cache server we are using, and the identifiers of the videos we will store on it. The final thing that remains to be done then is to open a file, write the length of the list returned by “solution” method (the number of cache servers we will use), and then write each element of the list in a single line. We prepared a script which applied our solution to all of the files that we previously downloaded on our computer, and generates an output file for each one, so we only have to call a Python command and wait:

if __name__ == '__main__':

   for name in ('kittens', 'me_at_the_zoo', 'videos_worth_spreading', 'trending_today'):

       data = read_dataset(name + '.in')

       data = calculate_video_cache_score(data)

       data = totalize_scores(data)

       result = solution(data)

       with open(name + '.out', 'w') as output:

           output.write('{}\n'.format(len(result)))

           for cache_schedule in result:

               output.write(cache_schedule + '\n')

And that’s all! I hope you had a great time reading the article and learning with us.

Thanks!

Share on Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

Your email address will not be published. Required fields are marked *