Revisiting OS

The course that haunted me while I was studying it

Source: https://software.intel.com/en-us/articles/optimization-practice-of-deep-learning-inference-deployment-on-intel-processors

It’s been two weeks since I joined a startup and what a journey it has been thus far. I never knew the amount of learning opportunity available and the scope of application of the knowledge I had and it’s impact. What follows is what I went through in the last two weeks and what I got out of it.

OpenCV

I already wrote about it here:

Optimization

What I had were 3 videos to test with following stats:

Video 1

Length: 1 minute 45 seconds

Frames: 3150

Execution Time: 952.17 seconds

Video 2

Length: 5 seconds

Frames: 150 frames

Execution Time: 45.28 seconds

Video 3

Length: 15 seconds

Frames: 450

Execution Time: 104.15 seconds

Initially, we used to include the frame writing task as part of the model execution efficiency but decided not to as it is not the actual task of a detection model and got the following results by only considering the time taken by the model for detection and no other process following that.

Video 1

873.82 seconds from 952.17 seconds

Video 2

43.49 seconds from 45.28 seconds

Video 3

99.6 seconds from 104.15 seconds

As a next step, we decided to use UMat object to enable GPU usage while processing frames and we got some more improvement which is as follows:

Video 1

859.28 seconds from 873.82 seconds

Video 2

43.30 seconds from 43.49 seconds

Video 3

101.93 seconds from 99.6 seconds

We then decided to use Threads along with UMat object and what we got is as follows:

Video 1

863.75 seconds from 873.82 seconds

Video 2

43.92 seconds from 43.49 seconds

Video 3

103.29 seconds from 104.15 seconds

Thus we concluded that it was better only to use UMat object without threading available in OpenCV.

Deployment

I already wrote about it here:

Scaling

The problem we had was that we were unable to handle more than 2 concurrent requests using simple python threading and initially we thought it was because of this:

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython’s memory management is not thread-safe.

One solution was to create a new process on every new request but the problem was as follows:

If we execute a new process on each request, 71 processes can be concurrently handled if every process takes around 113 MB on an 8 GB system.

We solved it by making a new process for the code part requiring gpu inside the running thread which was one per request.

After all this, what we got is as follows:

1 min 45 sec video :

1 request: 7.77 sec || 2 requests: 19.1 sec || 4 requests: 52.3 sec || 8 requests: 173.93 sec

So, it was running on 100 requests but at 1000, it broke and we were able to solve it by releasing the memory used for video reading.

After this, we had another problem which was that we were using too many threads and they were taking resources so we decided to do the following:

Make job queues so that resources can be utilized in an efficient manner. Along with this we also required to make sure that the response was provided as soon as a request came and the program doesn’t waits after queue limit was reached for which we used the following method:

There should be a waiting queue and a processing queue.

It was further optimized to have just one daemon and a waiting queue but the problem of number of requests being processed at a time remained which was solved by using only as many number of daemons as the number of requests we want to process at a time with only a waiting queue to be processed by the daemons.

After all this, what we got is as follows:

For 1 Daemon

Around 2.8–3 seconds per request

For 2 Daemons

2.88–3.1 from 2.84–3.0

For 3 Daemons

4.2–6.2 from 2.88–3.1

For 4 Daemons

5.2–8.2 from 4.2–6.2

For 5 Daemons

6.06–9.3 from 5.2–8.2

For 6 Daemons

5.75–10.9 from 6.06–9.3

For 7 Daemons

7.47–13.4

For 8 Daemons

6.17–14.9

References

So, what’s next? Well, may be something on Redis Queue.