How a big website works
They have millions of visitors per day. What technologies use large sites?
We will compare the infrastructure of some big sites like Tumblr (10 million visits a day), Facebook (250 million) and a few others to show the choices.
This review is not complete, because sites do not provide details of their infrastructure, but the information I have collected should be sufficient to give an idea of how they can work, and of software for processing huge masses of data. And there will be surprises, especially on databases.
The search engine has created a file system more suitable for its operation made up of billions of read-only queries and also for distributing in clusters. The GoogleFS or GFS is proprietary and was conceived in the early days of the site, around 1998.
They use similar systems available on the market, such as IBM GPFS, a parallel cluster system also chosen by many companies. Or HDFS, Hadoop Distributed File System, which is part of a software package created by the Apache Foundation to process data in clusters. Adobe, AOL, Facebook, eBay, Hulu, LinkedIn, and even IBM use Hadoop or some of its tools.
One thing you will notice with all these big sites is that none uses Oracle! One reason is that they need huge amounts of servers and the fee depending on the number of machines, the cost would be astronomical. In addition, using open source solutions, that all do, it is easier to make changes and updates and corrections are faster. They primarily uses MySQL, the same software that powers your blog, but with additional tools and a good development team to optimize the software and make it very efficient.
The site depends largely upon MySQL whose operations are accelerated by Memcache, HAProxy and shardling which is a form of database partitioning.
But it also does extensive use of Redis, a storage as key-value DBMS, which is used for notifications on users' dashboard. They have a limited shelf life, which is suitable for Redis, which is also used for other ephemeral functions and even like a buffer to replace Memcache (you can store the pair URL/page as key-value).
Tumblr uses HBase for URL shortener for history, statistics and messages.
Although Cassandra was created by Facebook, which has given the project to the Apache Foundation, the company is no longer part of the list of users. Facebook replaced it with HBase.
HBase is used for the messaging system, but much of the storage is done by MySQL with Memcache to speed up operations. There are more programmers working on HBase than on MySQL though its use is more limited, suggesting that it is more difficult to implement.
Haystack is a system of specialized storage for photos. This is intermediate between a DBMS and a file system, with an index in memory, which optimizes the writing and reading.
Uses BigTable, which partially inspired Cassandra and HBase, for the main storage. In fact, BigTable is even offered to users on App Engine. Bigtable works in clusters running in thousands of machines, and has no performance problems on large amounts of data.
Among the sites that use Hadoop, many resort to HBase for storing masses of data. This is similar to Google's BigTable. It is better to store mass data with periodic update and Cassandra is better suited to transaction with continuous updates. It is also non-relational while Cassandra is NoSQL.
For these large sites, most used languages are Scala (designed, as its name indicates, to scale) and PHP. Java is compatible with Scala which can use its API.
Dailymotion and Facebook
Both sites are developed with PHP and retained it despite its relative slowness. Dailymotion uses the Symfony framework. This is understandable because the processing on the server is negligible compared to the time of transfer of videos.
Facebook did not want to reprogram his whole system in another language, it would not be a problem in itself, but if the new code has bugs, which is inevitable, it will affect millions of users. So it chooses instead to accelerate PHP by creating a compiler to binary, as well as a virtual machine for development.
Like many others, the site went to Scala. It started in PHP, while other started in Ruby, but when it comes to handling millions of requests per second, the Java virtual machine is more efficient.
Finagle is a RPC (Remote Procedure Call) tool, so a way for the client to query the server that is suitable for a large number of users. Written in Scala and running on the JVM with any communication protocol.
It was created by Twitter and is also used by Tumblr and others.
Kafka is an internal distributed messaging system, open source, created by LinkedIn,. It is used by Tumblr to store messages.
Thrift is a bandmaster for mixing services written in different programming languages, and an Apache project. It is mainly used by Facebook, Tumblr and probably many others.
Scribe is a user management system created by Facebook. It was used by Tumblr but quickly abandoned because it could not support the load.
This article describes the software. Going further and describing their interaction is more complicated and is dependent on the activity of the site. Also be aware that the use of these tools is not limited to unpacking from the carton. For each one must first test it on a particular service before integrating in the system and make use by the mass of users. Even when it comes to MySQL, there is the ease of use with a CMS and the wide use with all optimization tools and there are different things.
Do not be afraid of a future migration and refuse to use the simplest tools when a project is started. These are necessary for a good start. You will not know how to use tools made for heavy load before testing of a wide audience.
A website is never finished.