Processing huge files using FileReader.readAsArrayBuffer() in web browser

本文为转载文章, 仅用于自己的知识管理收集, 如果涉及侵权,请联系 suziwen1@gmail.com,会第一时间删除
收集该文章,并非代表本人支持文中观点,只是觉得文章内容容易引起思考,讨论,有它自有的价值

转载自: https://joji.me/en-us/blog/processing-huge-files-using-filereader-readasarraybuffer-in-web-browser/

The FileReader API in HTML5 allows web browsers to access user files without uploading the files to the web servers. It not only lightens the load of web server but also saves the time of uploading files. It is very easy to use FileReader.readAsText() to process a 300K log file. However, when the file size grows to 1GB, or even 2GB, then the page might crash in browser or not function correctly. This is because readAsText() loads the full file into memory before processing it. As a result, the process memory exceeds the limitation. To prevent this issue, we should use FileReader.readAsArrayBuffer() to stream the file when the web application needs to process huge files, so you only hold a part of the file in the memory.

Test Scenario

Our test scenario is to get the log time range using JavaScript from a given IIS log on local disk.

Sample IIS log:

1#Software: Microsoft Internet Information Services 10.0 
2#Version: 1.0 
3#Date: 2016-08-18 06:53:55 
4#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken 
52016-08-18 06:53:55 ::1 GET / - 80 - ::1 Mozilla/5.0+(Windows+NT+10.0;+WOW64;+Trident/7.0;+rv:11.0)+like+Gecko - 200 0 0 476 
62016-08-18 06:53:55 ::1 GET /iisstart.png - 80 - ::1 Mozilla/5.0+(Windows+NT+10.0;+WOW64;+Trident/7.0;+rv:11.0)+like+Gecko http://localhost/ 200 0 0 3 
72016-08-18 08:45:34 10.172.19.198 GET /test/pac/wpad.dat - 80 - 10.157.21.235 Mozilla/5.0+(Windows+NT+6.1;+Win64;+x64;+Trident/7.0;+rv:11.0)+like+Gecko - 404 3 50 265 
82016-08-18 08:46:44 10.172.19.198 GET /test/pac/wpad.dat - 80 - 10.157.21.235 Mozilla/5.0+(Windows+NT+6.1;+Win64;+x64;+Trident/7.0;+rv:11.0)+like+Gecko - 200 0 0 6 

Our target is to get the time range of this IIS log

Start time: 2016-08-18 06:53:55
End time: 2016-08-18 08:46:44

Implementation using `readAsText()`

It is quite easy and clear to implement the functionality using readAsText(). After getting the whole content of the file in a string, iterates each line from the beginning and get the string of first 19 characters, then check if it matches the date and time format. If it does, the string is the start time. In the same way, get the end time by iterating each line from the end.

1<input type="file" id="file" /> 
2<button id="get-time">Get Time</button> 
3<script> 
  document.getElementById('get-time').onclick = function () { 
      let file = document.getElementById('file').files[0]; 
      let fr = new FileReader(); 
      fr.onload = function (e) { 
          let startTime = getTime(e.target.result, false); 
          let endTime = getTime(e.target.result, true); 
          alert(`Log time range: ${startTime} ~ ${endTime}`); 
      } 
      fr.readAsText(file); 
  } 
  function getTime(text, reverse) { 
      let timeReg = /\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}/; 
      for (let i = reverse ? text.length - 1 : 0; reverse ? i > -1 : i < text.length; reverse ? i-- : i++) { 
          if (text[i].charCodeAt() === 10) { 
              let snippet = text.substr(i + 1, 19); 
              if (timeReg.exec(snippet)) { 
                  return snippet; 
              } 
          } 
      } 
  } 
25</script> 

The result of processing the sample IIS log (file size: 1K) matches our expectation.

However, the browser crashed if I chose a huge IIS log file (size: 2G) because readAsText() loaded the full file into memory and the process memory exceeds the limitation

Implementation using `readAsArrayBuffer()`

A File object in JavaScript is inherited from Blob object. We can use Blob.slice() to cut the file into pieces for further processing. The workflow is as below:

Gets the first 10KB of the file and decode it to text
Iterates each line from the beginning of the text and checks whether the string of first 19 characters matches the date and time format. If it does, then the string is the start time
Gets the last 10KB of the file and decode it to text
In the same way, get the end time by iterating each line from the end

Here is the code:

1<input type="file" id="file" /> 
2<button id="get-time">Get Time</button> 
3<script> 
  document.getElementById('get-time').onclick = function () { 
      let file = document.getElementById('file').files[0]; 
      let fr = new FileReader(); 
      let CHUNK_SIZE = 10 * 1024; 
      let startTime, endTime; 
      let reverse = false; 
      fr.onload = function () { 
          let buffer = new Uint8Array(fr.result); 
          let timeReg = /\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}/; 
          for (let i = reverse ? buffer.length - 1 : 0; reverse ? i > -1 : i < buffer.length; reverse ? i-- : i++) { 
              if (buffer[i] === 10) { 
                  let snippet = new TextDecoder('utf-8').decode(buffer.slice(i + 1, i + 20)); 
                  if (timeReg.exec(snippet)) { 
                      if (!reverse) { 
                          startTime = snippet; 
                          reverse = true; 
                          seek(); 
                      } else { 
                          endTime = snippet; 
                          alert(`Log time range: ${startTime} ~ ${endTime}`); 
                      } 
                      break; 
                  } 
              } 
          } 
      } 
      seek(); 
      function seek() { 
          let start = reverse ? file.size - CHUNK_SIZE : 0; 
          let end = reverse ? file.size : CHUNK_SIZE; 
          let slice = file.slice(start, end); 
          fr.readAsArrayBuffer(slice); 
      } 
  } 
38</script> 

Now we can get the expected result very quickly for a 2GB IIS log after using readAsArrayBuffer.

Processing huge files using FileReader.readAsArrayBuffer() in web browser

Test Scenario

Implementation using `readAsText()`

Implementation using `readAsArrayBuffer()`

Previous

Next

Processing huge files using FileReader.readAsArrayBuffer() in web browser

Test Scenario

Implementation using readAsText()

Implementation using readAsArrayBuffer()

Previous

Next

Implementation using `readAsText()`

Implementation using `readAsArrayBuffer()`