Log Processing with Fluent-Bit

Table of Contents

In cyber security effective log management is crucial for analyzing incidents, hunt for threats, and identify anamolies. This post will show how to setup a complete log ingestion pipeline using Fluent Bit on a Debian system, collecting logs from various sources and sending them to OpenSearch for analysis and visualization. With the logs ingested into OpenSearch it is possible to build dashboards, apply threat detection rules and hunt for threats. This should be the foundation of every defensive cyber security homelab.

Fluent Bit?

Fluent Bit is a lightweight, high-performance log processor and forwarder that enables you to collect data and logs from different sources, process them, and send them to multiple destinations. Unlike its big brother Fluentd, Fluent Bit is designed to be:

Lightweight: Uses minimal system resources (memory footprint under 650KB)
Fast: Written in C with performance in mind
Flexible: Supports multiple input sources and output destinations
Cloud-native: Perfect for containerized environments and edge computing

It picks up data from various locations (inputs), processes and organizes it (parsers and filters), and delivers it to the right destinations (outputs).

Architecture Overview

Here you can see the different components of Fluent Bit and how the log processing pipeline works:

Data Flow

Collection: Fluent Bit reads logs from various sources (systemd journal, files, containers)
Processing: Raw log data is parsed into structured fields and filtered
Routing: Processed logs are tagged and routed to appropriate outputs
Delivery: Logs are sent to OpenSearch with proper indexing and formatting

Installation and Setup

The installation of Fluent Bit is simple and can be automatted with tools like Ansible. Once setup and installed software updates are installed automatically using the apt package manager.

Installing Fluent Bit on Debian

# Add Fluent Bit GPG key
curl https://packages.fluentbit.io/fluentbit.key | gpg --dearmor > /usr/share/keyrings/fluentbit-keyring.gpg

# Add Fluent Bit repository
echo "deb [signed-by=/usr/share/keyrings/fluentbit-keyring.gpg] https://packages.fluentbit.io/debian/bullseye bullseye main" > /etc/apt/sources.list.d/fluent-bit.list

# Update package list and install
apt update
apt install fluent-bit

# Enable and start the service
systemctl enable fluent-bit
systemctl start fluent-bit

Directory Structure

The Fluent Bit configuration follows a modular approach and is set up as you can see below:

/etc/fluent-bit/
├── fluent-bit.conf          # Main configuration file
├── parsers.conf             # Parser definitions
├── inputs/                  # Input configurations
│   ├── systemd.conf
│   ├── docker.conf
│   └── ...
├── filters/                 # Filter configurations
│   ├── parsers.conf
│   ├── common.conf
│   └── ...
├── outputs/                 # Output configurations
│   ├── opensearch-system.conf
│   ├── opensearch-auth.conf
│   └── ...
└── parsers/                 # Detailed parser definitions
    ├── ssh.conf
    ├── ufw.conf
    └── ...

Simplier setups without multiple files and directories are also possible, but keeping it modular and structured makes modifications easier in the future.

Understanding the Pipeline

Main Configuration

The main configuration file (fluent-bit.conf) orchestrates the entire pipeline:

# ===========================================
# Fluent-Bit Main Configuration
# ===========================================

[SERVICE]
    Flush        5              # Flush data every 5 seconds
    Daemon       Off            # Run in foreground for systemd
    Log_Level    info           # Logging level
    Parsers_File /etc/fluent-bit/parsers.conf
    storage.path /var/lib/fluent-bit/

# Include modular configurations
@INCLUDE inputs/systemd.conf
@INCLUDE inputs/docker.conf
@INCLUDE filters/parsers.conf
@INCLUDE filters/common.conf
@INCLUDE outputs/opensearch-system.conf

The Four Main Components of FLuent-Bit

Component	Think of it as	Purpose
Input	“Where logs enter”	Collects raw log data from sources
Parser	“How text becomes data”	Converts unstructured text into structured fields
Filter	“How data is modified”	Enriches, filters, or transforms the data
Output	“Where logs end up”	Sends processed logs to destinations

Input Configuration

Systemd Journal Collection

The systemd input is perfect for collecting system logs on modern Linux distributions:

# System logs (journald, cron, kernel)
[INPUT]
    Name              systemd
    Tag               journal.system
    Systemd_Filter    _SYSTEMD_UNIT=systemd-journald.service
    Systemd_Filter    _SYSTEMD_UNIT=cron.service
    Systemd_Filter    _SYSTEMD_UNIT=kernel
    Strip_Underscores On        # Clean up field names
    Read_From_Tail    On        # Start from end of log

# Authentication logs
[INPUT]
    Name              systemd
    Tag               journal.auth
    Systemd_Filter    SYSLOG_FACILITY=4      # Auth facility
    Systemd_Filter    SYSLOG_FACILITY=10     # Auth private facility
    Systemd_Filter    _SYSTEMD_UNIT=sshd.service
    Systemd_Filter    _COMM=sudo
    Strip_Underscores On
    Read_From_Tail    On

Key Configuration Options Explained:

Tag: Labels the data stream for routing in filters and outputs
Systemd_Filter: Filters journal entries based on systemd fields
Strip_Underscores: Removes leading underscores from field names for cleaner output
Read_From_Tail: Starts reading from the end of logs (important for existing systems)

Docker Container Logs

[INPUT]
    Name              docker
    Tag               docker.*
    Docker_Mode       On
    Docker_Mode_Flush 5
    Docker_Mode_Parser container_firstline

This input automatically discovers running containers and collects their logs with container metadata.

Log Parsing

Parsing transforms unstructured log lines into structured data. This is one of the most important steps!

To have normalized field names over all indices it makes sense to use a naming standard like elastics ECS (Elastic Common Schema).

https://www.elastic.co/docs/reference/ecs/ecs-field-reference

SSH Log Parser Example

SSH logs contain valuable security information, but they’re in text format. Here’s how we parse a failed SSH login:

Raw Log Line:

Failed password for invalid user admin from 192.168.1.100 port 22 ssh2

A great tool to write and test regular expressions is regex101.com:

Parser Configuration:

# Failed password for invalid user admin from 192.168.1.100 port 22 ssh2
[PARSER]
    Name        ssh_failed_invalid_user
    Format      regex
    Regex       /^Failed password for invalid user (?<ssh_user>[^ ]+) from (?<src_ip>[0-9.]+) port (?<src_port>\d+) (?<protocol>\S+)$/

Parsed Output (JSON):

{
    "ssh_user": "admin",
    "src_ip": "192.168.1.100",
    "src_port": "22",
    "protocol": "ssh2",
    "message": "Failed password for invalid user admin from 192.168.1.100 port 22 ssh2",
    "timestamp": "2025-11-13T10:30:45.123Z"
}

Understanding Regex Parsers

Let’s break down the regex pattern:

/^Failed password for invalid user (?<ssh_user>[^ ]+) from (?<src_ip>[0-9.]+) port (?<src_port>\d+) (?<protocol>\S+)$/

^Failed password for invalid user - Literal text match
(?<ssh_user>[^ ]+) - Named group capturing the username (anything except spaces)
from - Literal text
(?<src_ip>[0-9.]+) - Named group capturing IP address
port - Literal text
(?<src_port>\d+) - Named group capturing port number
- Space
(?<protocol>\S+)$ - Named group capturing protocol (non-whitespace characters) until end of line

Multiple Parser Strategy

Complex logs often require multiple parsers. SSH logs have many different message formats:

# Successful login
[PARSER]
    Name        ssh_accepted_password
    Format      regex
    Regex       /^Accepted password for (?<ssh_user>[^ ]+) from (?<src_ip>[0-9.]+) port (?<src_port>\d+) (?<protocol>\S+)$/

# Too many failed attempts
[PARSER]
    Name        ssh_too_many_failures
    Format      regex
    Regex       /^Disconnecting authenticating user (?<ssh_user>[^ ]+) (?<src_ip>[0-9.]+) port (?<src_port>\d+): Too many authentication failures \[(?<auth_stage>[^\]]+)\]$/

# Session management
[PARSER]
    Name        ssh_session_opened
    Format      regex
    Regex       /^pam_unix\(sshd:session\): session opened for user (?<ssh_user>[^(]+)\(uid=(?<uid>\d+)\) by \(uid=(?<by_uid>\d+)\)$/

Filtering and Processing

Filters modify and enrich log data as it flows through the pipeline.

Parser Filter Application

The parser filter applies our regex parsers to incoming log data:

[FILTER]
    Name                parser
    Match               journal.auth
    Key_Name            MESSAGE        # Field containing the log message
    Parser              ssh_failed_invalid_user
    Parser              ssh_failed_password
    Parser              ssh_accepted_password
    Parser              ssh_session_opened
    Parser              ssh_session_closed
    Preserve_Key        On             # Keep original message
    Reserve_Data        On             # Keep unparsed fields

How Multi-Parser Filtering Works:

Fluent Bit tries the first parser (ssh_failed_invalid_user)
If it doesn’t match, tries the second parser (ssh_failed_password)
Continues until a parser matches or all parsers are exhausted
If no parser matches, the log passes through unchanged

Tag Rewriting for Routing

Tags determine where logs go. We can modify tags based on parsed content:

[FILTER]
    Name                rewrite_tag
    Match               journal.auth
    Rule                $ssh_user ^(root|admin|administrator)$ auth.privileged_user false
    Rule                $src_ip ^192\.168\. auth.internal false
    Rule                $src_ip ^10\. auth.internal false
    Rule                $MESSAGE Failed auth.failed false
    Rule                $MESSAGE Accepted auth.success false
    Emitter_Name        auth_rewriter

This creates new tags based on content:

auth.privileged_user for root/admin login attempts
auth.internal for internal IP addresses
auth.failed for failed authentication attempts
auth.success for successful logins

Data Enrichment

Add contextual information to logs:

[FILTER]
    Name                modify
    Match               auth.*
    Add                 log_type security
    Add                 source_system ${HOSTNAME}
    Add                 environment production

This adds metadata fields to every authentication log, making them easier to search and analyze.

OpenSearch Integration

OpenSearch (the open-source fork of Elasticsearch) receives our processed logs for storage and analysis.

Output Configuration

[OUTPUT]
    Name                opensearch
    Match               auth.*
    Host                192.168.4.33
    Port                9200
    HTTP_User           logstash
    HTTP_Passwd         P@$$w0rD123456
    Index               logs-auth
    Type                _doc
    Logstash_Format     On              # Use Logstash naming convention
    Logstash_Prefix     logs-auth       # Index prefix
    Logstash_DateFormat %Y.%m.%d        # Date format for index names
    Generate_ID         On              # Generate unique document IDs
    Retry_Limit         5               # Retry failed sends
    Suppress_Type_Name  On              # Modern OpenSearch compatibility
    tls                 On              # Use encryption
    tls.verify          Off             # Skip certificate verification (for self-signed)

Index Strategy

Our configuration creates daily indices:

logs-auth-2025.01.15 for authentication logs on January 15, 2025
logs-system-2025.01.15 for system logs on the same day
logs-firewall-2025.01.15 for firewall logs

This strategy provides:

Easy retention management: Delete old indices easily
Performance optimization: Smaller, time-based indices query faster
Logical organization: Related logs grouped together

Document Structure in OpenSearch

Here’s what a processed SSH log looks like in OpenSearch:

{
  "@timestamp": "2025-01-15T10:30:45.123Z",
  "ssh_user": "admin",
  "src_ip": "192.168.1.100",
  "src_port": "22",
  "protocol": "ssh2",
  "log_type": "security",
  "source_system": "web-server-01",
  "environment": "production",
  "message": "Failed password for invalid user admin from 192.168.1.100 port 22 ssh2",
  "systemd_unit": "sshd.service",
  "priority": "6",
  "facility": "4"
}

Comparison with Other Tools

Fluent Bit vs. Elastic Agent

Aspect	Fluent Bit	Elastic Agent
Resource Usage	Very low (~650KB memory)	Higher (50-200MB memory)
Flexibility	High - supports many outputs	Medium - Elastic Stack focused
Configuration	Text-based, version controllable	GUI-based (Fleet) + YAML
Parsing	Powerful regex and built-in parsers	Good with Elastic Common Schema
Learning Curve	Moderate	Easier with Elastic Stack knowledge
Vendor Lock-in	None - open source	Some (optimized for Elastic)
Updates	Manual configuration changes	Centralized Fleet management

When to Choose Fluent Bit

Choose Fluent Bit when you need:

Minimal resource usage (edge computing, IoT)
Multi-destination log shipping
Custom parsing logic
Non-Elastic destinations (OpenSearch, Kafka, etc.)
Fine-grained control over log processing

Choose Elastic Agent when you want:

Seamless Elastic Stack integration
Centralized management via Fleet
Pre-built integrations for popular services
Easier initial setup for Elastic-only environments

Fluent Bit vs. Other Alternatives

Filebeat:

Lighter than Elastic Agent but heavier than Fluent Bit
Less flexible parsing capabilities
Better for simple log forwarding

Logstash:

Much heavier resource usage
More powerful filtering and processing
Better for complex transformations

Vector:

Similar performance to Fluent Bit
Rust-based with strong type safety
Good alternative with growing ecosystem

Best Practices

1. Resource Management

# Configure buffering to handle traffic spikes
[SERVICE]
    storage.path /var/lib/fluent-bit/
    storage.sync normal
    storage.checksum off
    storage.backlog.mem_limit 5M

2. Error Handling

# Add retry logic and dead letter queues
[OUTPUT]
    Name                opensearch
    Match               *
    Retry_Limit         5
    Workers             2
    # ... other config ...

3. Security Considerations

# Use environment variables for sensitive data
[OUTPUT]
    Name                opensearch
    HTTP_User           ${OPENSEARCH_USER}
    HTTP_Passwd         ${OPENSEARCH_PASSWORD}
    tls                 On

4. Monitoring and Observability

# Enable metrics endpoint
[SERVICE]
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

Access metrics at http://your-server:2020/api/v1/metrics/prometheus

5. Testing Configurations

# Test configuration syntax
fluent-bit --config /etc/fluent-bit/fluent-bit.conf --dry-run

# Run with verbose logging for debugging
fluent-bit --config /etc/fluent-bit/fluent-bit.conf --log-level debug

6. Performance Optimization

Use appropriate flush intervals (5-30 seconds)
Configure worker threads for outputs
Monitor memory usage and adjust buffers
Use storage buffering for reliability

Troubleshooting Common Issues

Parser Not Matching

# Test regex patterns
echo "Failed password for admin from 192.168.1.1 port 22 ssh2" | \
    fluent-bit --config test-parser.conf --log-level debug

OpenSearch Connection Issues

# Test connectivity
curl -u user:password https://opensearch-host:9200/_cluster/health

# Check Fluent Bit logs
journalctl -u fluent-bit -f

High Memory Usage

Review buffer settings
Check for parsing loops
Monitor input rates vs. output capacity

Conclusion

Fluent Bit provides a powerful, efficient solution for log collection and processing. Its lightweight nature makes it perfect for modern infrastructures while maintaining the flexibility needed for complex log processing requirements.

Key takeaways:

Modular configuration makes maintenance easier
Powerful parsing transforms unstructured logs into searchable data
Flexible routing allows sophisticated log distribution
OpenSearch integration provides excellent analytics capabilities
Lower resource usage compared to alternatives

Start with basic configurations and gradually add complexity as your needs grow. The investment in proper log parsing pays dividends in operational visibility and incident response capabilities.

Author

Lars Ursprung