LinkedIn Interview Question: You need to distribute a tera... | Glassdoor

Interview Question

Site Reliability Engineer Interview Mountain View, CA

You need to distribute a terabyte of data from a single

  server to 10,000 nodes, and then keep that data up to date. It takes several hours to copy the data just to one server. How would you do this so that it didn't take 20,000 hours to update all the servers? Also, how would you make sure that the file wasn't corrupted during the copy?
Tags:
technical
Answer

Interview Answer

6 Answers

3

My solution was to arrange all of the servers into a master-slave relationship. So that there are, for example, 3 top-level masters. Each of these top-level masters will pull the data from the single master server, and then propogate those files to 3 slave level servers below them. Those slaves are masters to 3 slaves below it, and so on and so on. I suggested rdist as the mechanism to make sure these files are always up to date. This was not the answer he was looking for, but he actually liked it better than the real answer. (I never did find out what the real answer was.)

My answer to making sure the file wasn't corrupted was to run md5sum on the data, and copy that information over as well. Run md5sum on the copied data and compare.

Interview Candidate on Oct 9, 2013
4

P2P is the first thing that came to my mind. BitTorrent is a good tool and I believe Twitter or Facebook has developed this kind of distributing tool based on BitTorrent protocol.

And I don't believe the 1TB data will be read at the same time. We can write a FUSE module that mount the directory from the central server. When one of the files was read, we can copy it and cache it locally.

Anonymous on Nov 14, 2013
1

P2P is the best solution. If P2P is not allowed. We can use broadcast, for example:
server1-> server2, then we have 2 sources, server1->server3, server2-server4; then it will take about (time to transfer one copy)*log(10000)= 3*3.32192809489*(time to transfer one copy). For fault tolerance, redundancy or retry. Or deploy a distributed file system so that file can be accessed.

Central server solution is like NFS, however, NFS's server could be bottleneck.

Anonymous on Mar 6, 2016
2

P2P is the best solution. If P2P is not allowed. We can use broadcast, for example:
server1-> server2, then we have 2 sources, server1->server3, server2-server4; then it will take about (time to transfer one copy)*log(10000)= 4*3.32192809489*(time to transfer one copy). For fault tolerance, redundancy or retry. Or deploy a distributed file system so that file can be accessed. If possible compression for saving (time to transfer one copy)

Central server solution is like NFS, however, NFS's server could be bottleneck.

Anonymous on Mar 6, 2016
0

Use multicast such as uftp. Send the file once and all the clients receive the package. For integrity, can send a MD5 hash along the way to make sure the file is valid upon arrival.

Anonymous on Aug 19, 2018
0

We can take checksum of the file when it is correct and compare it with the checksum of the same file in future. If both checksum are found to be same then the file isn't corrupted otherwise it is corrupted.

Someone on Jan 6, 2019

Add Answers or Comments

To comment on this, Sign In or Sign Up.