In one of our recent meteor applications we included a full document search feature using elasticsearch. Elasticsearch creates an index of documents based on metadata and their plain text content. For this feature we needed to support PDF and office filetypes (doc, docx, pptx etc.) as well. To accomodate this, elasticsearch has a plugin called elasticsearch-mapper-attachments.
Because we wanted to use a docker image to run elasticsearch, we decided to extend the
elasticsearch:2.3.3 image and add the plugin on top of it. The plugin takes care of transforming the documents into a plaintext format using apache tika. The plaintext of the document is then used to create a document in elasticsearch.
My colleague Bryan pushed it to dockerhub for anyone to use under the tag
bryantebeek/elasticsearch-mapper-attachments:2.3.3. We can now provision our server with this docker image using ansible, and configure the volumes to ensure the indexes created by elasticsearch are persisted on the docker host disk.
The elasticsearch service is now running and ready to accept connections from node application code. Before we can index any documents, we have to create the index itself. We use the elasticsearch npm module to setup the connection to elasticsearch:
Now we can create an index, we call it “files” and set the file property to be of type “attachment” to trigger the use of the mapper plugin:
Whenever we now upload a document in the application, we read it into memory, transform it into base64 and use the same elasticsearch client to create a new entry in the “files” index:
The document is now added to elasticsearch, and ready to be retrieved in the result of a search query. When the user uses the search functionality, a query is sent to the elasticsearch client and the results returned to the front-end:
The hits object in the result contains an array of hits sorted by search score, which can then be rendered as pleased!