“Our Core Idea Is To Make Existing CCTV Cameras Smart”- Atul Rai, Staqu Technologies

What if shop owners could identify their most profitable areas just using their cameras? One that can recognise voices or track foot traffic? A Gurugram startup has made this possible with JARVIS. With a goal to make CCTVs smarter, Atul Rai of Staqu Technologies tells EFY’s Nitisha everything about their innovation.


Atul Rai, Co-founder and CEO, Staqu Technologies

Q. How is Staqu Technologies transforming traditional CCTV cameras using AI for security and business insights?

A. Our core idea is to make existing CCTV cameras smart. Typically, cameras are used for security or safety protocols, but function as dumb devices, just recording footage. Staqu’s goal is to act as the brain behind these cameras. For example, if two people are fighting in front of a camera, the system will automatically detect the situation and send alerts to the relevant stakeholders. The system uses the camera’s IP (internet protocol) address to access the stream. Then it runs it through Staqu’s software product, JARVIS, which performs various types of analysis on the video frames to automate security and safety monitoring. In addition to security, Staqu also utilises these cameras to generate valuable insights, such as footfall count, gender identification, and mapping customers’ dwell times, particularly in the retail sector.

Q. Who exactly is your target audience right now?

A. We serve three primary sectors. First is the government, particularly law enforcement and police departments across 11 Indian states, who use the system for smart city and safe city surveillance, analysing traffic, vehicle number plates, facial recognition for criminals, and crowd control. Second is the manufacturing sector, where companies utilise JARVIS to monitor safety protocols, ensuring compliance with requirements such as wearing safety helmets and jackets, as well as detecting potential accidents. Third is the retail sector, along with co-working spaces and hotels, which use JARVIS to generate insights such as footfall, demographic breakdowns, customer journey paths, and dwell times in different zones of their premises.

- Advertisement -

Q. How does JARVIS work with CCTV camera feeds to detect incidents?

A. JARVIS consumes the video stream and performs different kinds of analysis. There are over 80 different analytics running on the system, such as violence detection, unusual crowd formation, fire detection, and more. Customers do not need to install any hardware. Onboard the camera’s IP address into the JARVIS portal, and it begins processing the feed in real time. The AI (artificial intelligence) built into JARVIS generates alerts and analytics for stakeholders. In retail settings, as I mentioned earlier, it allows businesses to gain intelligence from physical spaces in the same way websites track user activity.

Q. How is JARVIS different from traditional CCTV surveillance systems?

A. Traditional CCTV systems rely on manual monitoring, where a person sits in front of multiple screens trying to watch live footage. However, it is not feasible to manually monitor thousands of cameras, especially in large cities like Gurugram, which has more than 1400 CCTV cameras. The human attention span is limited to five to seven seconds, making manual surveillance prone to errors and the potential for missed incidents. JARVIS addresses this by utilising AI to continuously monitor all camera feeds in real time. It can detect incidents instantly and generate alerts, which manual systems cannot do. Moreover, traditional systems do not create insights like footfall count, gender breakdown, or engagement metrics. JARVIS, on the other hand, performs these analytics purely through software, acting like an intelligent operating system for the camera.

Q. What kind of AI algorithms does JARVIS use?

A. JARVIS employs a combination of advanced AI algorithms. It utilises convolutional neural networks (CNNs) for detecting objects and people in video frames. It also includes large vision models (LVM), which are transformer-based models similar to the architecture of GPT models but tailored for visual data. There are also pure transformer-based classifiers integrated for advanced categorisation tasks. On the audio side, JARVIS uses models for identifying individuals through voice and detecting specific audio events like shouting or cries for help. These audio analytics are language-independent, meaning they can identify a person even if the spoken language differs from the language in the database. Additionally, JARVIS includes activity recognition models to identify ongoing actions, such as violence or cleaning, which occur over a series of frames rather than a single instance. Staqu has also developed a layer of natural language capability using large language models, allowing users to interact with JARVIS via platforms like WhatsApp by asking questions, such as ‘How many people are inside a store right now?’

Q. What is the coverage capability of your system per camera?

A. It uses a technique called planogram mapping, where different zones in a store, such as shirts, suits, or sarees sections, are marked virtually in the camera feed as regions of interest (ROIs). Within the JARVIS software, each of these areas is defined to enable the system to measure visitor engagement and dwell time within each zone. One camera can support multiple ROIs; in some Indian retail deployments, a single camera covers 35 to 40 ROIs. There is no hard limit on the number of zones. It depends on the user’s needs and the physical layout of the space.

Q. How do you manage and store data?

A. We do not store or monetise any customer data. All data is processed and stored on the customer’s own server or cloud infrastructure. JARVIS is deployed at the customer’s location, and analysis happens entirely within their environment. Staqu follows all relevant data protection protocols, including those mandated by AWS (Amazon Web Services), Microsoft, and Indian cybersecurity laws. The company is also GDPR (General Data Protection Regulation)-compliant and undergoes regular security audits on a monthly basis. The systems are certified for data privacy and security, ensuring that the client fully manages data governance.

Q. What is the pricing model?

A. We charge on a per-camera, per-month basis. The starting cost is around ₹8000 per camera per month, depending on the use cases and features required. It operates on a software-as-a-service (SaaS) model. Users subscribe to the service, onboard their camera’s IP address into the portal, and the system begins processing. It is similar to subscribing to Microsoft Office or other SaaS platforms.

Q. How many enterprise customers do you currently have?

A. We currently serve around 170 enterprise customers. Some prominent brands include Raymond, Starbucks, Embassy Group, Olive, Porsche, Dunkin’ Donuts, and Adani Power.

Q. In which cities or countries are you active?

A. We are active in nine countries. In India, the company operates in tier-2 and tier-3 cities such as Jaunpur and Mirzapur, as well as major metropolitan areas like Delhi, Mumbai, and Gurugram. As I said earlier, law enforcement agencies in 11 Indian states are currently using Staqu’s technology. We are also expanding internationally with plans to open an office in Dubai.

Q. Do you have patents or published research?

A. Yes, we hold two patents and have published 25 research papers. One patent covers the ability to process over 400,000 video frames per second from 14,000 cameras in real time. The second patent is for a re-identification system that can recognise individuals across different camera angles without using facial recognition. Instead, it relies on gait, posture, and body structure to maintain privacy.

Q. Do you have your lab facility?

A. Staqu’s headquarters and central research lab are located in Gurugram. We also have offices in Bengaluru, Delhi, Mumbai, and Lucknow. For high-end research labs, Staqu is exploring opportunities in the UK and the US, with a focus on cities such as Cambridge or Oxford. We have also partnered with institutions like IIT (Indian Institute of Technology) Delhi for collaborative research.

Q. How many employees do you have?

A. We have approximately 145 employees. About 70 per cent of the workforce is engaged in engineering and research and development (R&D). Business teams are located in various cities.

Q. What challenges did you face in the early years of building AI adoption?

A. In the early days, particularly around 2020, it was difficult to convince people of AI’s potential. Many clients were sceptical about the technology’s ability to deliver meaningful results without additional hardware. To overcome this, we offered free pilots and demonstrations. However, the launch and widespread adoption of ChatGPT significantly increased public trust in AI, which helped Staqu gain traction.

Q. Coming back to JARVIS, can it perform real-time facial recognition across multiple cameras?

A. Yes, JARVIS is capable of real-time facial recognition across multiple cameras. For example, it is used in Uttar Pradesh jails to track visitors and detect repeated interactions with inmates. This system has already helped law enforcement in incidents like the Patna hospital shooting by identifying individuals quickly through camera feeds.

Q. How accurate is JARVIS in identifying faces, vehicles, or objects?

A. JARVIS achieves 99.7 per cent accuracy for facial recognition on the LMW2 benchmark dataset. For audio recognition tasks using the VoxCeleb dataset, it achieves 98 per cent accuracy. These results are based on rigorous benchmarking against public datasets commonly used by global AI research institutions.

Q. Can JARVIS help identify missing or wanted individuals?

A. At present, JARVIS is not actively used for identifying missing persons. It has been implemented by police forces primarily to identify criminals. However, Staqu is exploring the possibility of expanding into such use cases, provided that privacy and legal compliance can be maintained.

Q. How does JARVIS help law enforcement during large events or rallies?

A. JARVIS has been used successfully in large-scale events like the Ayodhya operations. It helped monitor crowd density, identify fake number plates by cross-checking vehicle details with the government’s Vahan database, and detect criminals using facial recognition. The system also supports reverse facial search, where a photo can be uploaded to search for appearances across camera footage.

Q. Can JARVIS integrate with any camera?

A. Absolutely, JARVIS can be integrated with any standard CCTV camera. There are no special hardware requirements.

Q. Can you elaborate more on what kind of audio events can JARVIS detect?

A. JARVIS can detect various audio events such as gunshots, screams, glass breaking, and calls for help. It utilises voice identification models that operate across languages and can identify speakers even when they switch languages. The audio models are trained on language-independent features, such as phonemes and graphemes.

Q. Does JARVIS support multi-language and multi-modal input?

A. Yes, JARVIS supports language-independent speaker recognition and is currently running bi-modal AI that combines audio and video data. Text data processing has also been enhanced through the use of large language models, enabling natural language interaction with the system.

Q. Does JARVIS support speaker identification or voice-based search?

A. It supports speaker identification. JARVIS can identify an individual speaking near a camera based on a pre-existing voice sample, regardless of the language spoken during identification.

Q. Does the software support mobile access?

A. Yes, JARVIS has a dedicated mobile app that allows for remote monitoring and interaction with the system.

Q. What are the hardware or bandwidth requirements?

A. JARVIS requires approximately 1 Mbps (megabit per second) of internet bandwidth per camera for real-time video streaming and analysis. It can operate on standard broadband connections without any specialised infrastructure.

Q. Do you have any future plans or product launches?

A. We plan to expand into the B2C (business-to-consumer) market within the next one to one and a half years. The goal is to democratise security by enabling everyday users to convert their ordinary CCTV cameras into smart devices using JARVIS. Staqu also plans to establish international R&D labs and expand the scope of analytics and natural language features available in JARVIS.

Q. What was your last fiscal performance?

A. In 2022, Staqu had an annual recurring revenue of around ₹25 to ₹30 million. Currently, we are operating at an ARR (annual recurring revenue) of ₹370 to ₹400 million. This represents a tenfold increase over the past four years. We are also six per cent EBITDA (earnings before interest, taxes, depreciation, and amortisation) positive and have never lost a customer since our inception.

- Advertisement -
Nitisha Dubey
Nitisha Dubey
Nitisha Dubey is a journalist at EFY. She focuses on startups and innovations with a deep interest in new technologies and business models.

Industry's Buzz

Learn From Leaders

Startups