The retention policy of data stored in a time series database is a subject that sooner or later every system administrator will face in order to keep the server space consumption under control.
This article explains how to introduce a retention policy with which we will be able to delete documents stored in Elasticsearch 6.x (hosted by Amazon Web Server in this case) older than a certain range of time using the Elasticsearch 6.x API.
The technology stack used in this article:
Python 3.6.5
The method we are gonna use is called delete_by_query with the following syntax:
POST twitter/_delete_by_query
{ "query": { "match": { "message": "some message" } } }
The above query shows how to delete documents making a punctual match based on content of “message” string.
In order to achieve our desired retention policy, we will delete all the documents having been created earlier than a certain date and we will do it in regular basis thanks to a Python script. In order to do so, we will use the “range” keyword combined with “lte“, the “less than equal” operator.
This time the query will look like:
POST twitter/_delete_by_query
{ "query": { "range": { "timestamp": { "lte": "June 7th 2018, 23:59:59.999" } } } }
Doing so, we will delete all the stored documents which were created before June 8th 2018.
To make it systematically, we can embed the deletion query in a scheduled Python script running, for instance, every night. The following Python script, using the Elasticsearch Python client, will delete all the documents older than seven days from now, which have been recorded in test-index, having “tweet” as doc_type.
from elasticsearch import Elasticsearch from datetime import datetime, timedelta retention_days = 7 offset = datetime.now() - timedelta(days=retention_days) retention_policy = offset.replace(hour=23, minute=59, second=59, microsecond=99999) es = Elasticsearch(['ES_EndPoint']) body = { 'query': { 'range': { 'timestamp': { 'lte': retention_policy } } } } es.delete_by_query( index='test-index', body=body, doc_type='tweet' )
This is the situation in the AWS Elasticsearch dashboard before to run the code of above:

So running the code, finally we will end up deleting the only 3 available documents, obtaining the following response in case everything ran just fine:
{ 'took': 7, 'timed_out': False, 'total': 3, 'deleted': 3, 'batches': 1, 'version_conflicts': 0, 'noops': 0, 'retries': {'bulk': 0, 'search': 0}, 'throttled_millis': 0, 'requests_per_second': -1.0, 'throttled_until_millis': 0, 'failures': [] }
In the end AWS will show the following final state:
