Monday, 5 February 2018

IRIS-H (alpha): Added ZIP files support

Quick Summary

Build Version: 0.0.1(alpha)
Change Type: new feature
Affected Components: API, UI
Short Description: API side code logic has been added to allow submitting ZIPed files. Industry standard password 'infected' is supported. UI side 'Submission' and 'About' pages have been updated to reflect the new changes.
Outstanding Tasks: None
Known Issues: ZIP files created with Ubuntu 'Archive Manager' throw an error.

Detailed Summary

The code logic has been added to IRIS-H to allow handling file extraction from ZIP archive files. The 'Submission' page will now accept ZIP file upload and perform the following operations with it:

  • identify if the file is a Microsoft Office document in OOXML format
  • identify the number of files in the archive
  • identify if the password is set
  • identify the unpacked size of the compressed file contained in the archive
  • identify if the archive file is 'nested'

The following restrictions and limitations are applied:

  • ZIP file must contain a single file
  • if ZIP file password is enabled it must be set to 'infected'
  • unpacked size of the compressed file contained in the archive must not exceed 10MB
  • ZIP 'nesting' must not exceed 2 levels (ZIP-in-a-ZIP)
  • ZIP file size must not exceed 4MB*
* 4MB ZIP file size limit is enforced by the underlying technology employed to handle the file extraction. More on this in the following section.

What's under the hood?

Disclaimer: The choice of the technology used to implement ZIP files support was mainly driven by a will to learn it. Another contributing factor though is the lack of good NodeJS libraries that provide password protected ZIP files handling.

IRIS-H API and UI components are written in different flavours of JavaScript. Originally, I was looking to implement ZIP files support using a JS library, but to my surprise I couldn't find the one with proper support for different compression and encryption types. I realized it would have to be implemented in a different programming language, but the integration with the rest of the service and its infrastructure seemed challenging until I decided to look into using AWS Lambda.

AWS Lambda supports a number of programming languages including C# with .NET Core 2.0. This opens up a good number of possible solutions. The choice stopped with SharpZipLib. This library supports most of the compression and encryption methods. Building an AWS Lambda function turned out to be a rather easy task. The most challenging part was dealing with the 'RequestResponse' size limitations enforced by 'Invoke' function. The only solution I could find was to apply the ZIP file size limit at the submission time. It's currently set to 4 MB due to the lambda's set limit of 6 MB. 2 MB difference goes toward 'base64' conversion the submitted ZIP file is a subject to when sent to the lambda function. 

Testing it with ZIP files of different sizes shows that it takes about 10 seconds on average to process a 4 MB ZIP file. Those under 1 MB are processed almost with no delay.

Like the rest of the service, this new feature is experimental and requires more thorough testing. I'd appreciate any feedback.