Retention policy in Elasticsearch 6.x on AWS: deleting a document by query

The retention policy of data stored in a time series database is a subject that sooner or later every system administrator will face in order to keep the server space consumption under control.

This article explains how to introduce a retention policy with which we will be able to delete documents stored in Elasticsearch 6.x (hosted by Amazon Web Server in this case) older than a certain range of time using the Elasticsearch 6.x API.

The technology stack used in this article:

Elasticsearch 6.0, hosted by AWS
Python 3.6.5

The method we are gonna use is called delete_by_query with the following syntax:

POST twitter/_delete_by_query

{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}

The above query shows how to delete documents making a punctual match based on content of “message” string.

In order to achieve our desired retention policy, we will delete all the documents having been created earlier than a certain date and we will do it in regular basis thanks to a Python script. In order to do so, we will use the “range” keyword combined with “lte“, the “less than equal” operator.

This time the query will look like:

POST twitter/_delete_by_query

{
  "query": { 
    "range": {
      "timestamp": {
          "lte": "June 7th 2018, 23:59:59.999"  
      }
    }
  }
}

Doing so, we will delete all the stored documents which were created before June 8th 2018.

To make it systematically, we can embed the deletion query in a scheduled Python script running, for instance, every night. The following Python script, using the Elasticsearch Python client, will delete all the documents older than seven days from now, which have been recorded in test-index, having “tweet” as doc_type.

from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
retention_days = 7
offset = datetime.now() - timedelta(days=retention_days)
retention_policy = offset.replace(hour=23, minute=59, second=59, microsecond=99999)
es = Elasticsearch(['ES_EndPoint'])
body = {
  'query': {
    'range': {
      'timestamp': {
        'lte': retention_policy
      }
    }
  }
}
es.delete_by_query(
  index='test-index', 
  body=body, 
  doc_type='tweet'
)

This is the situation in the AWS Elasticsearch dashboard before to run the code of above:

AWS ES Dashboard before deletion

So running the code, finally we will end up deleting the only 3 available documents, obtaining the following response in case everything ran just fine:

{
  'took': 7,
  'timed_out': False,
  'total': 3,
  'deleted': 3,
  'batches': 1,
  'version_conflicts': 0,
  'noops': 0,
  'retries': {'bulk': 0, 'search': 0},
  'throttled_millis': 0,
  'requests_per_second': -1.0,
  'throttled_until_millis': 0,
  'failures': []
}

In the end AWS will show the following final state:

AWS ES Dashboard after deletion