Posted in Accept, content-type, HTTP

Content-Type vs Accept, HTTP Header

Sunday, October 05, 2014

Comment

이 답변 이 젤 심플하다.

As you correctly note, the Accept header is used by HTTP clients to tell the server what content types they'll accept. The server will then send back a response, which will include a Content-Type header telling the client what the content type of the returned content actually is.

However, as you may have noticed, HTTP requests can also contain Content-Type headers. Why? Well, think about POST or PUT requests. With those request types, the client is actually sending a bunch of data to the server as part of the request, and the Content-Type header tells the server what the data actually is (and thus determines how the server will parse it).

In particular, for a typical POST request resulting from an HTML form submission, the Content-Type of the request will normally be either application/x-www-form-urlencoded or multipart/form-data.

Request Header의 Accept 는 클라이언트가 어떤 컨텐츠 타입을 받길 원하는가 이고, Content-Type 은 어떤 컨텐츠 타입을 실제로 보내는가 를 기록한다.

PUT 이나 POST 를 생각해 보면 Response Header 뿐만 아니라, Request Header 에도 Content-Type 이 포함될 수 있는데, HTML FORM 에서 생성되는 전형적인 Content-Type 은 application/x-www-form-urlencoded 와 multipart/form-data 다.

RFC 2616 원문을 보고 싶으면 이리로, Accept, Content-Type

application/x-www-form-urlencoded 는 Spec 에 스페이스를 + 로 바꾸고, key-valur pair 사이를 &로 분리하는 등의 일을 하는데, 큰 문제점이 하나 있다. 바로 non-alphanumeric bytes 가 %HH 로 변경되는데, 1byte 를 3bytes 로 바꾸면, 커다란 파일이나, 바이너리 혹은 non-ASCII 로 구성된 파일들을 보낼때 마다 굉장히 비효율적인 인코딩을 하게 된다.

그래서 나온게 바로 multipart/form-data 인데, 이건 데이터에 나타나지 않는 특별한 패턴, boundary 를 찾아 각 key-valur pair 를 part 로 나눈다. 각 파트마다 Content-Disposition 이라는 Content-Type 필드를 가지고 있고, 여기에 나온 MIME type 에 따라 적절한 인코딩을 한다.

<FORM action="http://server.com/cgi/handle"  
       enctype="multipart/form-data"
       method="post">
   <P>
   What is your name? <INPUT type="text" name="submit-name"><BR>
   What files are you sending? <INPUT type="file" name="files"><BR>
   <INPUT type="submit" value="Send"> <INPUT type="reset">
</FORM>

다음과 같은 Form 을 보낸다고 하면, 실제로는

  Content-Type: multipart/form-data; boundary=AaB03x

   --AaB03x
   Content-Disposition: form-data; name="submit-name"

   Larry
   --AaB03x
   Content-Disposition: form-data; name="files"; filename="file1.txt"
   Content-Type: text/plain

   ... contents of file1.txt ...
   --AaB03x--

만약에 파일을 하나 더 보낸다면

   Content-Type: multipart/form-data; boundary=AaB03x

   --AaB03x
   Content-Disposition: form-data; name="submit-name"

   Larry
   --AaB03x
   Content-Disposition: form-data; name="files"
   Content-Type: multipart/mixed; boundary=BbC04y

   --BbC04y
   Content-Disposition: file; filename="file1.txt"
   Content-Type: text/plain

   ... contents of file1.txt ...
   --BbC04y
   Content-Disposition: file; filename="file2.gif"
   Content-Type: image/gif
   Content-Transfer-Encoding: binary

   ...contents of file2.gif...
   --BbC04y--
   --AaB03x--

그렇다고 해서 multipart/form-data 가 항상 좋은건 아니다. 간단한 alpha-numeric 이라면 boundary 를 찾고 MIME 인코딩 하고 디코딩 하는 과정 대신 그냥 application/x-www-form-urlencoded 로 보내면 된다.

참고로, Content-Type 이 text 일 경우에는 문자 인코딩을 지정하기 위해 charset 을 사용할 수 있다. text/plain; charset=utf-8 처럼

SO 답변 에는 두 가지 컨텐츠 타입 뿐 아니라 application/xml 이나 application/json 을 이용해서도 binary 나 non-ascii 를 보낼수 있단다.

application/octet-stream

Content-Type 에 들어갈 수 있는 MIME 타입에는 application/octet-stream 이라는 arbitrary binary data 를 위한 값이 있다. 따라서

Content-Type: application/octet-stream  
Content-Disposition: attachment; filename="picture.png"

이건, 뭔진 모르겠지만 파일 이름은 picture.png 고 저장해줘, 라는 뜻이다.

Content-Type: image/png  
Content-Disposition: attachment; filename="picture.png"

이 요청은 이건 png 고, picture.png 라는 이름으로 저장해줘, 라는 뜻이다.

Content-Type: image/png  
Content-Disposition: inline; filename="picture.png"

이 요청은, 이건 png 인데 방법을 안다면 보여줘, 라는 뜻이다.

Content-Encoding

Content-Encoding 은 그럼 무엇일까? HTTP Compression 을 보면, 클라이언트가 받길 기대하는 Encoding 의 목록이라고 나와있다. 문자열 인코딩이 아니라, gzip 이나 bzip2, deflate 같은 압축 방법이다.

먼저 클라이언트가 다음과 같이 요청을 보내면

GET /encrypted-area HTTP/1.1  
Host: www.example.com  
Accept-Encoding: gzip, deflate

서버가 다음과 같이 보낼 수 있다.

HTTP/1.1 200 OK  
Date: mon, 4 Oct 2014 22:38:34 GMT  
Server: Apache/1.3.3.7 (Unix)  (Red-Hat/Linux)  
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT  
Accept-Ranges: bytes  
Content-Length: 438  
Connection: close  
Content-Type: text/html; charset=UTF-8  
Content-Encoding: gzip

그럼 클라이언트는 Content-Encoding 필드를 파악한 뒤, 압축을 풀거나 하면 된다.

Transfer-Encoding

end-to-end 인 Content-Encoding 과는 반대로 Transfer-Encoding 은 hop-toh-top 이다. SO 답변을 그대로 가져오면

Transfer-Encoding is hop-by-hop, while Content-Encoding is end-to-end.

This means that if there is a proxy involved, anywhere, the proxy will see the TE gzip, unzip it, and not necessarily forward the request as TE gzip.

So, the choices are

CE gzip and always know what you will be getting, requiring logic to decompress the response.

TE gzip and never know what you will be getting requiring logic to decide whether to decompress the response and the logic to decompress it when required.

The logical choice is CE gzip.

요약하자면, CE(Content-Encoding) 은 gzip(예를들어) 되어 올걸 알기 때문에 압축을 풀 로직이 필요한 반면, TE(Content-Encoding) 은 중간에 gzip 을 풀 수 있는 프록시가 있다면, 풀려서 올수도 있고 아닐 수도 있기 때문에 압축을 푸는 로직은 물론, 풀어야 하는지 아닌지를 결정할 수 있는 로직도 필요하다.

아래는 RFC 2616(HTTP 1.1) 에서 Roy T. Fielding 이 작성한 글

changing content-encoding on the fly in an inconsistent manner (neither "never" nor "always) makes it impossible for later requests regarding that content (e.g., PUT or conditional GET) to be handled correctly. This is, of course, why performing on-the-fly content-encoding is a stupid idea, and why I added Transfer-Encoding to HTTP as the proper way to do on-the-fly encoding without changing the resource.

Transfer-Encoding: Chunked

Chunked transfer encoding 에 보면, HTTP Response 를 여러번 보낼 수 있는 방법이 있다. 바로 Content-Length 대신 Transfer-Encoding 에 chunked 값을 넣어 보내다가 final chunked(empty) 가 오면 끝난다.

예를들어, 서버가 다음과 같은 데이터를 보내면 (2라인씩 쌍지어 첫줄은 데이터의 길이, 두번째는 실제 데이터 chunk 다)

4\r\n  
Wiki\r\n  
5\r\n  
pedia\r\n  
e\r\n  
 in\r\n\r\nchunks.\r\n
0\r\n  
\r\n

클라이언트는 다음과 같이 해석한다. 각 chunk 는 CLRF 로 끝나며 이건 길이에 포함되지 않는다. final chunk 는 길이가 \r\n 로만 표시된다.

Wikipedia in

chunks.

여기 보면 이걸 쓰는 이유가 몇 가지 나오는데,

(1) HTTP persistent connection 을 만들 수 있고
(2) 마지막에 부가적인 헤더를 첨부할 수 있으며
(3) Content-Encoding 과 같이 쓰일 수 있다.

Summary

HTTP 헤더를 좀 살펴 보았다. HTTP 2.0 나 SPDY 도 좀 살펴보자.

References

(1) HTML Form Specification
(2) form-data vs x-www-urlencoded
(3) application/octet-stream
(4) why-content-encoding-gzip-rather-than-transfer-encoding-gzip
(5) Transfer-Encoding vs Content-Encoding

Old Lisper

Lisp, Emacs, Scala