#WebApps #Serverless #Docker #Concurrency #Python #WebDevelopment
So, you’re wondering how web apps work at scale, particularly when it comes to handling a large number of concurrent users. It’s a common question, and one that can seem overwhelming at first. But fear not, we’re here to break it down for you.
Let’s start with the basics and work our way up from there.
##Understanding Scalability in Web Apps
When it comes to scaling a web app to handle a high volume of users, several key factors come into play:
1. Server Infrastructure
– The server that hosts your web app needs to have sufficient resources to handle the load.
– Serverless solutions, such as AWS Lambda or Google Cloud Functions, automatically scale based on demand.
2. Load Balancing
– Load balancers distribute incoming traffic across multiple instances of your web app, ensuring no single instance is overwhelmed.
3. Containers and Docker
– Docker containers can be used to package your web app and its dependencies, making it easier to deploy and scale.
##Concurrency and Threading
When it comes to handling concurrent users, you may be wondering how many copies of your web app need to be running.
1. Concurrency in Code
– Depending on the framework or language you’re using, you may need to specify how your web app handles concurrent requests.
– Threading can be used to enable parallel processing and improve performance.
2. Global Interpreter Lock (GIL)
– In the case of Python, the GIL can impact the ability to run multiple threads simultaneously.
– Understanding how the GIL works is crucial when building process-intensive web apps.
##Real-World Examples
To illustrate how web apps work at scale, let’s consider a few common examples:
1. File Conversion Web App
– A web app that converts file types (e.g., PDF to Word) needs to handle multiple users uploading and converting files simultaneously.
– The backend infrastructure needs to be able to handle the processing load efficiently.
2. Image Processing Web App
– An AI art generation web app, such as turning a photo into a painting, requires significant CPU/GPU resources for image processing.
– Scaling the app to handle concurrent users submitting images is a complex task.
##Optimizing Performance and Costs
In your scenario, where a Python web app is CPU/GPU intensive and requires 2 minutes to process an input, optimizing performance and managing costs becomes critical.
1. Cost Considerations
– Running a process-intensive web app at scale can incur significant costs, especially in a serverless environment.
– Understanding the cost implications of scaling your app is essential, especially for a portfolio project.
2. Infrastructure Planning
– Planning for scalability and resource allocation is essential to ensure your web app can handle a large number of concurrent users.
– Utilizing serverless technologies can help automate scaling based on demand and reduce infrastructure management overhead.
In conclusion, understanding how web apps work at scale involves a combination of server infrastructure, load balancing, concurrency management, and cost optimization. As you continue to explore and learn about web app development, diving into real-world examples and practical demonstrations can further enhance your understanding of scalability.
We hope this breakdown sheds light on the intricacies of web app scalability and provides a solid starting point for your exploration. Feel free to explore additional resources and tutorials to deepen your knowledge further.
*Keep learning, keep experimenting, and most importantly, enjoy the journey of web app development!* 🚀
Instead of having one server processing your request, you have multiple. There is a load balancer that takes all the web server requests and routes them to a machine thats low on requests.
4 cores doesn’t mean you can only run 4 threads by the way. You can experiment with running more threads and see how it affects performance.
Also, this kind of stuff is when programming language performance comes into play. Python is very very very slow. Java is 2-3x times slower than C and Python is 20-30x slower than java. So if you can only run 4-5 requests on CPU intensive tasks on python, its possible you might be able to do 100-200 requests in C/Java.
As a bonus, just processing the http traffic is cheap, performance wise. Your python webserver can easily handle 100 concurrent requests. What you really need to scale is the cpu intensive jobs you’re running. You can go head and just scale to having 10 machines running your python web server, but even more ideally, you could have a single webserver that push the request to a pub sub system like Kafka, then have multiple instances that are just kafka workers that fetch a request from kafka and process it.
https://www.baeldung.com/cs/scaling-horizontally-vertically
Is this CPU bound or I/O bound? Why does it take 1-2 minutes? That answer will help inform how you’d run this in production I.e. scale it out.
What is your 2 minute process _doing_?
I’m not saying there’s no valid case where a web server would take that long to respond, but it’s exceedingly rare.
Learning about scaling and all that is important and I don’t want to distract from that…but my gut instinct in your case is to ask if you’re sure that whatever it is you’re doing _needs_ to take 2 minutes, or if you just have horribly inefficient code?
In 99% of real world scenarios, if I find that I have an endpoint that takes even 5 seconds to respond, my first thought isn’t what I can do to scale that service. My first thought is why in the world is whatever it is taking an entire 5 seconds, and how can I fix that?
In OP case where is his end point located? Just curious or does anyone in this thread know exactly
re: “endpoint takes 2 minutes to take an input and spit out an output”
if this is a proper web app function, then you might look into using something like a web socket. So your client can submit the request from the front-end, and holding open a web socket, the server can send the response back over the socket and it will be displayed (as per your app logic) after being received. There are some older methods to similulate this as well such as http long-polling but iirc websocket is one of the primary methods in use these days. Its not terribly hard to set up either. Note that this is mostly relevant to web apps that you are interacting with in the web browser.
if you mean to interact with this from an API endpoint (so no front-end involvement), then you would likely need to use a “Submission” style method where your API client makes a POST request to the endpoint, passing the input data, and the server would accept the data, start the long running background task, and respond with an identifier that labels the task. Then the API client, having recorded the task ID, could just repeatedly GET request the server about the status of the task using the ID it was previously given. Eventually when the task is completed the server would record the task as Complete, the API would detect this from GET request, and then could take further action as necessary to e.g. request files or data from the result.
hope that makes sense. You would definitely want to get a database involved here for the latter case especially to facilitate tracking of the long running tasks. You would not want to run the long task in the same sever process that is handling the requests. In Python you would likely want to consider perhaps something like a Multiprocessing pool, or even better, Celery worker pool. Flask / Django + Celery + Rabbitmq + Postgres SQL is a very standard stack for this exact type of application.
Speeding up the response to one individual client is a separate thing. If it takes 2 minutes to respond to one client now, it will (very likely) take 2 minutes to respond to one client later BUT if you set it up to scale out, either with serverless or other technologies, then it can spin up 1000 instances in parallel and they can each respond to separate clients and they will each take only 2 minutes.
That is “scale out”. There is also “scale up” where possibly you can run it on a much beefier machine (more memory and cores) and maybe it will take less than 2 minutes. But that depends a lot on why it is taking 2 minutes.
There is also a database involved on most web applications. Databases are generally separate from the application code and have their own ways to scale. Depending on how intensively the database is used by the application, it may be able to handle 1000 concurrent sessions, or maybe it can’t.
2 minutes is hella intensive. Most webapps aren’t going to take that long to do something. Are you doing like some sort of image processing or similar?
Anyway there’s probably several ways to go about it. You can take the computer intensive stuff and pass it to a server that just handles that. People get a response for only the intensive task after long it takes to work through the jobs from other people. You speed it up however much you are willing to pay. Or however much your clients are willing to pay if you pass the cost on to them. Maybe some stuff doesn’t have to be repeated and you can cache the results.
Otherwise stuff like file conversions you can possibly offload the processing to a local JavaScript library and make their computer do it.
Only handle basic UI and DB stuff with the webapp directly. A single decent server should be able to handle thousands of people.
To add on to what others have said, any process that takes longer than a second or two is done in a background job so that the web process isn’t blocked
So to demonstrate one approach using your example
>I think some easy live examples are web apps that convert a file type into an other or those that modify an image into ai art. Maybe someone could explain it using those.
1) User uploads a file to convert into AI art
2) The file is stored somewhere and a “job” is created and added to a queue (could use something like Amazon SQS).
3) You have a pool of workers that read jobs from the queue and perform the AI image generation.
4) When the worker is done, it’ll save the result and notify the user somehow that the process has finished.
If your web app is THAT process intensive, it will be very expensive to scale. However, consider the following as food for thought:
1. Modern operating systems support multi-threading. Even if you have a single core CPU, it can run multiple processes and threads at the same time. The operating system accomplishes this by constantly switching processes, meaning each process gets a turn for a very short amount of time to use the CPU. This is very efficient in practice because things like memory access, network access, disk reads etc, are far slower operations that CPU processing, so if your process is waiting for a memory read, the CPU can be used by a different process.
2. Keeping the above in mind, imagine a single server hosting your application. Ever time a new request is made, the web server spins up a new thread to process the request. Even if it’s a single core CPU, the OS can still run multiple threads concurrently. However, since the OS is giving concurrent threads resources turn by turn, all of them run very slowly. It is impossible to “guess” how much concurrent load a server could handle for a given application, so in the real world we use “stress testing” tools that generate concurrent requests just to test how much a server can handle.
3. Once you have figured out that your server can handle say 20 concurrent users before it starts taking over 2 minutes to service requests, you can think about scaling provided that you expect more users than that. You could get a bigger server (vertical scaling), OR get multiple servers along with a load balancer (horizontal scaling)
4. In the old days of the internet, scaling was a big issue. You would rent a server that can handle 20 concurrent users, but what happens if your tool goes viral? Your server gets overloaded and crashes. You panic-rent more servers along with a load balancer, but it takes two days to get your app back online. This was more common than you would think. Services like Amazon Web Services were revolutionary and solved this major pain point. You could have your application hosted on an AWS server that handles 20 concurrent users, but as soon as AWS notices CPU or memory spiking, it would immediately provision more machines and use a load balancer to keep your service alive. This is what most internet apps do today.
5. You have to pay for all this infrastructure of course. If your service is so intensive, you’ll probably be paying a good chunk of money supporting concurrent users. This is why developers can build and host simple tools and just pay for them out of pocket, but tools like video conversion or YouTube downloader that are process intensive will have a free tier and paid tier to pay the bills. I doubt even ads on a popular website could pay for that kind of resource-intensive activity.
6. Scaling in general is a very hard and very deep topic. Horizontal and vertical scaling are just basic concepts. Big tech companies pay a loooot of money to people who can design scalable systems.
Actually made a video on website scalability couple of weeks ago, might be helpful: [https://youtu.be/3nInD2RGb2c](https://youtu.be/3nInD2RGb2c)
Also this video on load balancers might be helpful as well: [https://youtu.be/XSZQescGmco](https://youtu.be/XSZQescGmco)
You should learn about system design. Would help you understand what everyone else is saying. There are pros and cons to every solution and there is no prescriptive solution to everything. Or anything
flask apps run as WSGI apps. many WSGI hosts like Apache’s mod_WSGI can run many threads at once – https://modwsgi.readthedocs.io/en/develop/user-guides/processes-and-threading.html
This article has a good explanation
[https://bytebytego.com/courses/system-design-interview/scale-from-zero-to-millions-of-users](https://bytebytego.com/courses/system-design-interview/scale-from-zero-to-millions-of-users)
>How do 100 or 1000 people use it concurrently?
Honestly 1000 concurrent users should be basically nothing in any conversation about “at scale”.
>I have a python web app
>
>an endpoint takes 2 min to take an input and spit out an output
First part. I’ve previously defended the performance of Python (which leans heavily on native code anyway) because performance problems are almost always bad code/solution when I get the profiler out. Profile your process. IO bound or CPU bound? Find out what it’s spending 2 mins on. It’ll be file and/or network IO. Maybe it’s using some inefficient data structures for the problem at hand? E.g. constant CPU cache misses because of linked lists or whatever. Learn and understand the time/space complexity of your solution and whether it could be better. If the language truly is the bottleneck here, reimplement in something native without the GIL etc.
>I would make four copies and be able to use 4 instances concurrently
This is “horizontal” scaling, assuming you mean copies of the server. Load balancing requests between many application servers is standard and can also help. You almost certainly don’t need this if you’re just working on a personal project. The problem is your 2 min process, not server saturation.
>But at what level is that decided in web apps? Do I have to tell it to make 100 copies? Is this where docker comes into play?
It’s beyond the app. You’re venturing into infrastructure and system design. The software is written (or will be), how do we deploy it? Yes, containerisation can help with horizontal scaling. Applications often depend on the wider machine, the “environment”, for things. Containerisation just allows you to package an application and it’s dependencies into one “unit”. The abstraction allows you to create automation around deployments etc., such as you find in AWS ECS with services and tasks. Again, you probably don’t need this for a personal project, and it can get expensive quite quickly.
>specify concurrency in code like when I have to specify the use of threading in my script, should I enable threading and does it even work on web apps?
Depends on the app. VPS and container runners usually allow (or require) you to specify the computing resources you need, such as (v)CPU and memory. Multithreading works just fine in the right environment, and is quite common. So is single threaded asynchronous code. Ultimately you usually don’t want to use blocking IO.
I see a lot of googling in your future 😀
Typically when you have a 2 minute long process with a web app you do an asycnrhonous workflow. The user clicks “Go” and you immediately return a page that says “thanks, we will send you an update/alert when this is ready in a few minutes”
You then spawn a background thread or work process somewhere that crunches the data dn when it is done you update the web page or send an email or whatever. This could be done by just creating threads to work on these jobs on the same machine as the webserver, or you could send them off to one or many other machines to do the work and return the data, or use things like Amazon SQS queues that hold the input and a bunch of machines grab from the queue and process as fast as they can.
If I had a process that really did take 2 literal minutes and couldn’t be made faster, it would get put in a processing queue and the user can get it later. You can have a whole different computer or computers processing your queue that doesn’t have to be tied at all to what’s serving the website. And it can space out the jobs so it doesn’t max out CPU. You still do need to process them faster than they come in, at some point.
The thing you’re doing does sound very… expensive.