29 May 2025

Data Management in SIEM Tools: From Inception to Deletion

SIEM

Are you falling into these common traps?

At the surface SIEM tools aggregate and analyse data from across an organisation’s digital environment, most commonly enabling threats detection, supporting incident response and meeting compliance requirements. Acting as centralised location for data insights. 

But what is the full flow of this data, how do we best manage it and what are the common pitfalls? 

 

1.Onboarding/ Inception: The Beginning of the Lifecycle 

Inception marks the starting point of the SIEM journey. This is the act of enabling logging and/or forwarding to begin capturing events. The most common data sources include network devices, firewalls, servers, endpoints, IAM tools and cloud environments. 

The common pitfall we see with clients in the inception phase a product led approach. This method of inception will almost immediately lead to undue cost and a lack of demonstrable value.  

Inception should be led by risk and compliance. When Risk is considered first, over data and product, it provides a clearer landscape on the data feeds that are required in order to build a robust security posture. Focusing on data and product first may lead to both excessive data ingestion, leading to high costs, but can also lead to a false sense of security, where critical data feeds may have been missed. Both of these issues are alleviated by taking a Risk and Framework first approach, reinforcing robust security coverage and ingesting only the necessary data. 

 

2. Data Ingestion and Normalisation 

Once data sources are identified and onboarded, it’s not a given that the data is formatted in a way that it can be effectively analysed and queried. Good quality field extractions, normalisation and CIM modelling are critical.  

Commonly poor-quality extractions, normalisation and modelling lead to incomprehensible logs and failing detective use cases. 

Are you actively monitoring and solving errors in this space? They can come about at any time, most often caused by agent or product updates which unknowingly change data formats. 

When data has not been strategically formatted, it can have a knock-on effect on both the overall performance of the system, as well as the time required to complete development work on the SIEM. When data is not formatted correctly, most often to a standardised model, it can be less resilient to changes, such as field name alterations, which can be a regular occurrence with certain log sources. Furthermore, when data has not bene formatted to a uniform model, any data analysis and security detection searches but each be created in a bespoke manner for a given data feed, which greatly increases the workload. This can be avoided through the use of a unified data model, that will allow multiple data sources to be queried with a single search.  

Read about our approach here.

 

3. Storage and Archiving 

SIEM tools store logged data according to retention policies. Proper storage management is crucial not just for data availability but also compliance and cost efficiency. Depending on the deployment customers often have distinguish between hot (frequently accessed) and cold (archival) storage. 

Hot data is available for immediate query and is what drives detections, use-cases and the searches used for incident response. With most tools, cold data options are available, offering cheaper retention but with varying limitations on how and when it can be queried. 

Often organisations fall into 1 of 2 traps. Either utilising default retention policies, or using retention polices led by compliance and applying them at a global level, without consideration for the content. The first will leave you without data when you need it most, whilst the second almost always leads to increased cost. 

A clear data strategy is needed with each log type accounted for, and its content considered intensely. Are you aware of the content in each log and how that enables compliance? What components of a log contribute to CIM models and DUCs? Is 100% of content log required?  

These are all questions that should be asked in a data discovery. Critical for driving retention polices that enable without the cost.  

Recently we worked with a customer to discover that 100s of GBs of log content within data used for security was unneeded for both compliance and use-cases. We we’re able to trim down the content of the logs, reducing cost hugely, without any loss of functionality of value. 

Read more here.

 

4. Deletion 

With discovery done and a robust data strategy in place, it turns to engineering. Often misconfigured is the deletion of data and how it rolls between states. 

In most tools it’s possible to delete or roll data between storage and archive based on size and not time. The consequence? If this wasn’t planned for in your data strategy, you’ll be missing critical data because of an often-unseen configuration error. 

 

5. How is pipelining changing the game?

Pipelining tools such as Cribl are empowering organisations across all the full flow of data management, providing endless options at each stage.  

Read some more on pipelining here. 

    Stay updated with the latest from Apto

    Subscribe now to receive monthly updates on all things SIEM.

    We'll never send spam or sell your data, see our privacy policy

    See how we can build your digital capability,
    call us on +44(0)845 226 3351 or send us an email…